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A.  INTRODUCTION: 


We  hypothesized  that  many  of  the  PC  disease-associated  SNPs  already  identified  to  date  will  be  located 
in  regulatory  domains  involved  in  gene  transcription.  Furthermore,  we  hypothesized  that  candidate  genes 
affected  by  these  regulatory  elements  could  be  identified  by  taking  advantage  of  eQTL  datasets.  Therefore,  the 
objectives  of  this  grant  proposal  were  to:  1)  construct  a  prostate  tissue-specific  eQTL  dataset  that  could  be 
used  to  identify  candidate  genes  for  any  current  (or  future),  predictive  (or  prognostic)  SNP  identified  for  PC; 
and  2)  utilize  this  dataset  to  identify  candidate  genes  for  existing  PC  risk  SNPs  that  could  then  be  followed  up 
in  future  studies.  To  accomplish  this  goal,  we  proposed  to  perform  a  genome-wide  SNP  analysis  (using  the 
lllumina  Human  Omni  2.5M  SNP  array)  and  a  genome-wide  mRNA  expression  analysis  (using  RNA 
sequencing)  on  a  common  set  of  500  samples  of  normal  prostate  tissue  sampled  from  men  with  PC.  The  long¬ 
term  objective  of  this  strategy  is  to  characterize  the  functional  role  of  the  disease-causing  SNPs,  to  identify  the 
genes  and  biologic  pathways  affected  by  these  inherited  factors,  and  ultimately  to  identify  targets  for  disease 
prediction,  risk  stratification  and  identification  of  treatment  targets. 

B.  BODY: 

Statement  of  work  originally  proposed  for  years  1  and  2: 

Task  1.  Processing  of  normal  prostate  tissue  for  RNA  purification  (months  1-9) 

la.  Cryo-section  fresh-frozen  tissue  from  -500-600  cases  (months  1-9) 

lb.  Create  hematoxylin-eosin  (H&E)  stained  slides  from  each  case  for  review  (months  1-9) 

lc.  Review  of  sections  by  a  Pathologist,  (months  1-9) 

ld.  Select  500  cases  of  high-quality  samples  for  RNA  extraction  (Task  2)  (months  10) 

Task  2.  DNA  and  RNA  Extraction  from  500  cases  for  study  (months  11-12) 

2a.  Use  sections  from  500  samples  selected  from  Task  1  to  purify  DNA  and  total  RNA  (months  11-12) 

Task  3.  Genome-wide  genotyping  of  blood  DNA  from  500  cases  for  study  (months  12-14) 

3a.  Place  blood  DNA  (already  extracted)  in  96-well  plates  for  genotyping  (months  12) 

3b.  Genotype  samples  (months  12-14) 

3c.  Quality-control  checks  and  data  processing  -  Statistical  analyses  (months  14) 

Task  4.  Genome-wide  mRNA  profiling  of  tissue  RNA  from  500  cases  for  study  (months  13-15) 

4a.  Place  RNA  in  96-well  plates  for  expression  analysis  (months  13) 

4b.  Perform  expression  analysis  (months  13-14) 

4c.  Quality-control  checks  and  data  processing  -  Statistical  analyses  (months  15) 

Task  5.  Create  eQTL  dataset  -  Statistical  analysis  (months  16-24) 

5a.  Test  PC  risk-SNPs  for  their  association  with  transcript  level  for  all  mRNAs  utilizing  data  from  Tasks  3 
and  4  (months  16-18) 

5b.  Test  candidate  target  gene  for  association  with  all  other  SNPs  (months  18-21) 

5c.  Prepare  data  for  public  distribution  (months  21-24) 

Work  performed:  Task  1  (Processing  of  normal  prostate  tissue  for  RNA  purification) 

All  of  the  work  proposed  for  Task  1  has  been  completed. 

In  order  to  achieve  our  goal  of  500  samples  of  normal  prostate  tissue,  we  initially  reviewed  H&E  stained 
sections  from  all  archived  cases  available  for  study;  -4,000.  These  -4000  cases  were  obtained  from  patients 
whom  had  undergone  a  radical  prostatectomy  at  Mayo  Clinic  and  were  available  to  investigators  through  the 
Prostate  Cancer  SPORE.  Typically,  one  to  three  pieces  of  frozen  tissue  (snap  frozen  at  the  time  of  surgery) 
were  available  for  each  case.  At  the  time  each  case  was  initially  processed,  a  representative  H&E  stained  slide 
was  made  from  each  piece  of  tissue  and  archived  for  future  investigator  review  to  aid  in  the  process  of  tissue 
selection.  Although  the  archived  slide  allows  for  an  initial  evaluation,  blocks  are  used  over  time  and  the 
histology  can  change.  Thus,  cutting  an  additional  representative  H&E  is  often  necessary  to  re-evaluate  the 
current  state  of  these  blocks. 

For  this  study,  the  same  Pathologist  was  used  throughout  the  evaluation  process  to  ensure  consistency. 
In  our  initial  pre-screen  of  the  -4000  normal  tissue  cases,  we  first  removed  all  cases  where  the  patient’s  tumor 
had  a  Gleason  score  greater  than  7,  cases  where  tumor  was  found  on  the  H&E  slide  and  cases  where  normal 
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prostate  tissue  was  not  available.  Following  this  initial  review,  916  pieces  of  tissue  were  available  for  further 
processing.  The  archived  tissue  was  then  pulled  from  long-term  storage  and  a  fresh  representative  H&E 
stained  slide  was  prepared  for  re-evaluation  by  a  Pathologist.  In  order  to  meet  the  needs  of  this  study,  the 
following  criteria  were  developed  for  further  tissue  selection  and  processing: 

1.  No  tumor  present  on  the  new  H&E. 

2.  The  section  viewed  had  to  be  from  the  posterior  region  of  the  prostate  -  all  central  and  anterior  zone 
tissues  were  eliminated.  The  region  of  interest  was  determined  based  on  histologic  landmarks  and 
Mayo  practice  processes  (posterior  region  are  inked  for  orientation). 

3.  No  High-Grade  Prostatic  Intraepithelial  Neoplasia  (HGPIN). 

4.  No  greater  than  1  %  of  the  cells  on  the  slide  could  be  lymphocytes. 

5.  The  final  percent  of  epithelial  glands  present  on  the  slide  had  to  be  at  least  40%. 

Of  the  916  cases  re-examined,  93  cases  met  the  criteria  above,  but  also  contained  Benign  Prostatic 
Hyperplasia  (BPH),  seminal  vesicle,  urethra,  or  adjacent  central  zone.  These  pieces  of  tissue  were  further 
processed  to  eliminate  the  contaminating  portion  and  an  additional  H&E  stained  section  was  prepared  to 
ensure  that  the  block  was  processed  correctly  and  the  unwanted  regions  were  adequately  removed. 

Following  the  final  review  of  tissue,  565  cases  met  the  selection  criteria  noted  above.  Due  to  the  small 
number  of  cases  meeting  our  strict  histologic  criteria  (565  of  -4000  cases  reviewed),  most  of  the  selected 
cases  did  not  have  blood  available  for  the  extraction  of  DNA  (for  genotyping).  As  a  result,  we  chose  to  take 
additional  sections  of  the  normal  prostate  tissue,  which  allowed  for  the  extraction  of  both  RNA  (expression)  and 
DNA  (genotyping).  From  past  experience,  we  expected  that  a  degree  of  histologic  change  would  be  present 
throughout  the  sectioning  process  and  this  would  result  in  an  additional  -10%  of  the  cases  failing  to  meet  our 
selection  criteria.  Thus,  we  decided  to  section  and  evaluate  all  565  cases,  re-evaluate  H&E  stained  sections 
once  more  and  then  choose  the  best  cases  for  the  final  processing. 

Work  performed:  Task  2  (DNA  and  RNA  Extraction  from  500  cases  for  study) 

All  of  the  work  proposed  for  Task  2  has  been  completed. 

For  the  extraction  of  DNA  and  RNA,  tissue  was  first  sectioned  on  a  cryostat,  preparing  10-micron  thick 
sections.  Prior  to  sectioning,  however,  all  of  the  samples  were  randomized  into  cutting  groups  based  on 
percent  epithelium,  presence  or  absence  of  lymphocytes,  the  time  of  original  tissue  collection,  and  if  the  tissue 
came  from  prostate  cancer  patients  or  from  patients  having  a  cysto-prostatectomy  due  to  bladder  cancer.  The 
randomization  of  samples  was  performed  in  order  to  control  for  any  cutting  bias  that  might  be  introduced  as  the 
tissue  was  processed  each  day.  The  565  cases  were  sectioned  over  a  period  of  26  working  days  in  the 
following  manner:  the  initial  section  was  taken  for  an  H&E  stained  slide  (to  serve  as  a  one-to-one  comparison 
with  the  initially  reviewed  H&E  section  to  confirm  that  no  tissue  mix-up  had  occurred),  then  multiple  sections 
placed  in  tube  1  for  RNA,  a  2nd  H&E  section,  multiple  sections  placed  in  tube  2  for  RNA,  3rd  H&E  section, 
multiple  sections  placed  in  tube  3  for  DNA,  4th  H&E  section,  multiple  sections  placed  in  tube  4  for  DNA,  and  the 
final  H&E  section.  For  the  RNA-destined  tubes,  tissue  was  immediately  placed  in  QIAzol  buffer  and  then  snap 
frozen  to  ensure  high-quality  RNA.  For  the  DNA-destined  tubes,  sections  were  placed  in  tubes  and  initially 
stored  at  -80°  C.  These  tubes  were  then  collected  the  following  day,  and  QIAgen  Gentra  Puregene  cell  lysis 
buffer  and  proteinase  K  were  added  to  both  DNA  tubes  and  digested  overnight  at  55°  C  on  a  shaking  incubator 
essentially  as  outlined  by  the  manufacturer.  Visual  confirmation  was  done  the  following  day  to  ensure  all  of  the 
tissue  was  digested.  The  tubes  were  then  considered  stable  and  stored  at  4°  C  pending  completion  of  the  DNA 
extraction. 

All  five  H&Es  sections  outlined  above  were  evaluated  once  again  by  a  Pathologist  to  ensure  that  no 
histologic  changes  had  occurred  as  the  tissue  was  sectioned.  Additionally,  the  1st  H&E  was  used  to  compare  to 
the  original  H&E  confirming  that  no  specimen  mix-ups  had  occurred.  Upon  histologic  review  of  all  five  H&E 
slides,  roughly  10%  of  the  cases  were  eliminated  due  to  histologic  changes  (i.e.  the  appearance  of  small 
cancer  foci,  change  in  percent  epithelium,  appearance  of  HGPIN,  an  increase  in  lymphocytic  presence). 
Following  this  final  review,  505  cases  remained  that  met  the  initial  criteria.  Again,  because  we  anticipated  that 
there  would  be  a  small  number  of  cases  having  poor-quality  RNA  or  poor-DNA  yield,  an  additional  19  cases 
were  selected  that  had  2%  infiltrative  lymphocytes  present  for  the  final  process  of  DNA  and  RNA  extracted. 
These  524  cases  were  then  split  into  two  batches  for  RNA  extraction  and  re-randomized  again  as  previously 
described,  but  now  the  randomization  scheme  also  included  the  day  the  tissue  was  processed.  This 
randomization  was  performed  to  avoid  any  batch  effects  during  RNA  extraction. 

DNA  was  extracted  by  first  performing  a  protein  precipitation  step  (Qiagen  protein  precipitation  solution), 
followed  by  an  isopropanol  then  Ethanol  rinse.  The  DNA  pellet  was  allowed  to  dry,  then  dissolved  in  TE  and 
allowed  to  mix  overnight.  After  mixing,  DNA  was  quantified  using  a  nanodrop,  and  concentrations  were 
standardized.  Total  RNA  was  extracted  the  using  the  RNeasy  Mini  Kit  (Qiagen)  according  to  the 
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manufacturer’s  instructions  on  the  Qiacube.  RNA  was  then  assessed  for  quality  using  an  Agilent  chip 
technology.  Cases  having  a  RIN  number  of  7.0  or  greater  were  considered  good  quality.  Once  completed,  the 
optimum  set  of  500  samples  were  then  selected  for  the  mRNA  expression  and  DNA  genotyping  studies  based 
on  RNA  and  DNA  quality  and  those  samples  meeting  the  most  strict  selection  criteria  (i.e. ,  higher  percent 
epithelium,  no  or  fewest  lymphocytes  present).  Following  this  initial  selection,  six  samples  were  later  omitted 
because  they  were  found  to  not  meet  the  original  criteria  for  the  grade  of  tumor  (Gleason  score  of  7  or  less). 

Work  performed:  Task  3  (Genome-wide  genotyping  of  blood  DNA  from  500  cases  for  study) 

All  of  the  work  proposed  for  Task  3  has  been  completed. 

As  originally  proposed,  DNA  from  500  tissue  samples  were  selected  and  randomized  to  96-well  plates 
with  two  CEPH  controls  on  each  plate.  Samples  were  then  genotyped  using  the  lllumina  Human  Omni  2.5M 
SNP  array.  These  studies  along  with  the  QC  analyses  to  identify  sample  and/or  SNP  quality  issues  have  been 
completed. 

QC  analyses  included  the  evaluation  of  call-rates,  minor  allele  frequencies,  and  tests  of  Hardy-Weinberg 
Equilibrium  (HWE)  for  each  of  the  SNPs.  The  QC  filters  that  were  applied  to  the  genotypic  data  include 
excluding  SNPs  with:  1)  call-rate  <  95%;  2)  MAF  <  1%;  3)  HWE  p-value  <  1e-4;  4)  concordance  in  duplicates  < 
99.5%;  and  5)  unknown  physical  position  based  on  current  genome  build.  In  addition,  we  estimated  the 
genotyping  error  rates  by  checking  for  Mendelian  consistency  and  duplicate  concordance  rates  using  CEPH 
controls.  Finally,  we  tested  for  potential  batch  effects  by  testing  for  allele  frequency  and  call  rate  differences 
across  plates.  Subject  level  QC  included  calculation  of  call-rates,  sex  determination,  as  well  as  calculation  of 
pair  wise  identity  by  descent  probabilities  for  all  pairs  of  subjects  in  order  to  identify  and  remove  related 
subjects.  See  Appendix  1  and  2  for  complete  QC  report.  Appendix  1  includes  information  for  all  SNPs  and  all 
samples.  Appendix  2  provides  information  after  excluding  problematic  SNP  and  problematic  samples  and 
includes  additional  QC  tests. 

Overall,  the  quality  of  the  2.5M  SNP  genotyping  data  is  excellent.  A  total  of  1 7  of  494  samples  were 
flagged  for  QC  reasons;  5  samples  had  a  SNP  call  rate  <  95%,  10  are  non-Caucasian  (5  African,  5  Asian)  and 
2  subjects  appear  to  be  first  cousins.  After  excluding  one  of  the  related  pair,  we  have  478  unrelated,  Caucasian 
samples  remaining  for  analysis.  SNP  exclusions  are  summarized  below.  We  have  -1.5M  QC-passed  SNPs 
with  MAF  >=  1%  available  for  analysis. 

Sample  exclusions:  494  samples 

5  call  rate  <  95% 

10  non-Caucasian  (5  African;  5  Asian) 

1  related  pair 

Samples  remaining:  478 

SNP  exclusions:  2,372,617  SNPs  are  on  the  2.5M  array 

6,409  call  rate  <  95%  (205  failed  completely) 

454,736  monomorphic 

902  HWE  p-value  <  1e-5  (276  with  p  <  1e-10) 


SNPs  remaining:  1,910,570 
MAF  >  1%  1,558,636 

Work  performed:  Task  4  (Genome-wide  mRNA  profiling  of  tissue  RNA  from  500  cases  for  study) 

All  of  the  work  proposed  for  Task  4  has  been  completed. 

In  the  original  statement  of  work,  we  had  proposed  the  use  of  the  lllumina  humanht-12  BeadChip  as  the 
platform  to  derive  the  genome-wide  mRNA  expression  dataset.  However,  the  cost  of  next  generation 
sequencing  (NGS)  dropped  dramatically  over  the  course  of  our  project  and,  as  a  result,  we  explored  the  option 
of  performing  RNA  profiling  by  NGS  (RNAseq).  The  use  of  RNAseg  significantly  increased  both  the  Quality  and 
value  of  this  dataset.  We  were  able  to  obtain  additional  funds  to  supplement  the  DOD  award  to  perform  these 
experiments,  and  following  approval  by  the  Scientific  Officer,  we  changed  our  approach  for  this  task  to  RNA 
sequencing.  To  accomplish  the  work  proposed,  we  utilized  the  Agilent  SureSelect  RNA  capture  kit  for  the  RNA 
library  preparation  and  the  lllumina  HiSeq  2000  for  the  RNA  sequencing.  For  these  experiments,  samples  were 
first  randomized  to  library-prep  groups.  The  randomization  was  performed  as  previously  described,  but  now  the 
randomization  scheme  included  both  the  day  the  tissue  was  processed  and  the  RNA  extraction  group.  This 
randomization  was  performed  to  avoid  any  batch  effects  during  sequencing.  Samples  were  indexed  such  that 
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five  samples  were  analyzed  in  a  single  lane.  Our  goal  was  to  achieve  a  minimum  of  50  million  reads  per 
sample  -  and  this  has  been  accomplished. 

The  first-phase  Bioinformatic  analysis  was  completed  using  an  in-house-developed  pipeline,  MAP-RSeq. 
MAP-RSeq  is  a  comprehensive  computational  pipeline  for  secondary  analysis  of  RNA-Sequencing  data.  MAP- 
RSeq  uses  a  variety  of  freely  available  bioinformatics  tools  along  with  in-house-developed  methods.  Alignment 
and  mapping  of  the  reads  was  performed  using  Bowtie  (http://bowtie-bio.sourceforge.net/index.shtml)  and 
TopHat  (http://tophat.cbcb.umd.edu/)  software.  Gene  counts  were  generated  using  HTseq  software 
(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)  and  gene  annotation  files  were  obtained 
from  lllumina  (http://cufflinks.cbcb.umd.edu/igenomes.html).  For  single  nucleotide  variant  (SNV)  calling,  we 
used  the  GATK  (http://www.broadinstitute.org/gatk/)  software.  SNVs  were  further  annotated  and  filtered  for 
quality,  coverage  and  other  criteria  using  variant  quality  score  recalibration  (VQSR)  method.  MAP-RSeq  also 
provides  a  list  of  expressed  fusion  transcripts  using  TopHat-Fusion  algorithm.  All  of  the  bioinformatics  analysis 
using  MAP-RSeq  has  now  been  completed. 

As  with  the  Genotype  data,  QC  assessment  of  the  RNAseq  data  is  also  completed.  We  compared  RNA- 
called  genotypes  to  genotypes  from  the  lllumina  Human  Omni  2.5M  array  to  test  for  sample  mix-ups.  To 
investigate  factors  that  may  influence  the  number  of  counts  observed,  we  summarized  the  log2(gene  counts) 
and  the  percentage  of  counts  >  0  by  subject,  lane,  flowcells,  %GC  content  per  gene  and  by  gene  size 
(counting  only  the  sum  of  the  exons).  Data  quality  was  assessed  via  per-specimen  box  plots  and  minus  versus 
average  (MVA)  plots.  The  box  plots  were  sorted  by  various  experimental  factors,  e.g.,  batch  and  run  order  in 
order  to  examine  global  shifts  in  counts  due  to  these  factors.  The  existence  of  and  functional  form  of  biases 
between  specimens  were  assessed  via  residual  MVA  plots.  The  modified  MVA  plot  uses  a  linear  model  to 
examine  trends  in  residuals.  A  detailed  description  with  examples  of  the  QC  analyses  performed  is  provided  in 
Appendix  3.  Overall,  the  quality  of  the  RNAseq  data  was  excellent. 

In  addition,  a  manual  review  of  several  Bioinformatically  generated  sample-specific  RNAseq  parameters 
(Figure  1)  was  conducted  for  each  sample.  These  include  the  following:  junction  saturation  (Figure  2  A); 
splice  junctions  (Figure  2  B);  inner  distance  (Figure  2  C);  read  duplication;  and  gene  body  coverage  (Figure  2 
D).  Figure  1  shows  data  for  five  representative  samples,  while  Figure  2  shows  data  for  two  samples,  one  with 
acceptable  data  (left)  and  one  with  unacceptable  data  (right).  From  these  analyses,  eight  samples  were 
flagged  as  potentially  problematic. 


Figure  1 _ Figure  2  A  B 


Work  performed:  Task  5  (Create  eQTL  dataset) 

All  of  the  work  proposed  for  Task  5  has  been  completed. 

For  the  eQTL  dataset,  we  are  interested  in  both  coding  (as  originally  planned)  as  well  as  newly  described 
long  intergenic  non-coding  RNA  (lincRNA).  The  standard  pipeline  described  above  provides  a  description  of  all 
of  the  coding  transcripts,  but  not  for  lincRNAs.  As  a  result,  we  developed  a  pipeline  to  identify,  quantify  and 
annotate  lincRNA  and  have  applied  this  to  our  RNAseq  data.  These  analyses  have  been  completed. 

The  pipeline  consists  of  several  modules: 

1 )  Candidate  transcript  assembly  module:  this  module  uses  a  genome-guided  strategy  for 

transcriptome  reconstruction.  The  aligned  BAM  files  (i.e.,  BAM  files  from  TopHat)  were  assembled  with 
Cufflinks  2.0.2.  The  option  “Reference  Annotation  Based  Transcript”  (RABT)  assembly  was  used 
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because  of  its  advantage  to  identify  novel  transcripts.  The  GENCODE  VI 6  was  used  as  annotation  file 
to  guide  the  transcript  assembly  processes. 

2)  LincRNA  identification  module:  this  module  aimed  to  identify  and  report  expressed  lincRNAs  in  the 
RNAseq  data.  To  achieve  this,  five  filtering  steps  were  used  as  follows. 

a.  Size  restriction:  transcripts  smaller  than  200  nt  were  removed. 

b.  Removal  of  known  protein-coding  regions:  candidate  transcripts  that  overlap  with  transcripts 
in  the  “protein-coding”  category  in  GENCODE  VI 6  were  removed. 

c.  Removal  of  transcript  homologous  to  known  proteins:  the  blastx  program  was  used  to 
evaluate  the  similarity  between  candidate  transcripts  and  known  proteins  in  the  RefSeq 
database  (protein  with  NM_  prefix).  The  transcripts  with  E  value  less  than  1e-4  were  removed. 

d.  Removal  of  transcripts  predicted  to  code  for  proteins:  the  candidate  transcripts  were  then 
assessed  for  their  coding  potential  by  the  CPAT  tool,  an  in  silico  computational  model 
classifying  coding  and  non-coding  transcripts.  Specifically,  a  logistic  regression  model  was  built 
based  on  four  sequence  features,  including  open  reading  frame  size,  open  reading  frame 
coverage,  Fickett  TESTCODE  statistic  and  hexamer  usage  bias.  A  training  dataset  was 
constructed  containing  both  known  protein-coding  (NM_  prefix  in  RefSeq  database)  and  non¬ 
coding  transcripts.  Compared  to  other  widely  used  tools  such  as  CPC  and  PhyloCSF,  CPAT 
has  higher  sensitivity  and  specificity  (>0.966),  and  is  much  faster  (i.e.,  process  thousands  of 
transcripts  within  seconds). 

e.  Known  protein  domain  filter:  the  remaining  candidate  transcripts  were  then  evaluated  whether 
they  contain  a  known  protein  coding  domain.  To  achieve  that,  each  candidate  transcript  was 
translated  in  all  three  reading  frames  and  compared  against  13,672  known  protein  family 
domains  documented  in  the  Pfam  database  Version  26  by  the  HMMER-3  tool.  HMMER-3  uses 
hidden  Markov  models  (HMMs)  to  scan  each  amino  acid  sequence  and  classify  whether  it 
resembles  any  of  the  known  domains  in  the  database.  Candidate  transcripts  with  a  significant 
Pfam  hit  (P  value  less  than  1e-5)  were  excluded. 

In  total,  we  identified  72,740  candidate  lincRNA  transcripts  at  38,899  intergenic  loci  in  494  normal 
prostate  tissue  samples.  Among  these  transcripts,  significant  overlap  was  observed  between  them  and 
lincRNAs  annotated  in  GENCODE  VI 7,  i.e.,  63%  of  lincRNAs  annotated  in  GENCODE  V  17  were  also 
identified  in  our  dataset.  These  prostate-derived  lincRNAs  were  further  examined  for  evidence  of 
transcriptional  activity  using  the  H3K4me3-H3K36mer3  domains  generated  from  nine  cell  lines  in  the  ENCODE 
project.  Overall,  18,368  lincRNAs  (-25%)  have  evidence  of  a  signature  consistent  with  an  actively  transcribed 
gene  across  the  entire  locus  (both  H3K4me3  across  the  promoter  region  and  H3K36me3  along  the  transcribed 
region).  Of  the  remaining  transcripts,  7,849  (1 1  %)  overlap  an  H3K4me3  peak  alone  (promoter  region)  and 
6,856  (9%)  overlap  an  H3K36me3  peak  alone  (transcribed  region).  A  manuscript  describing  the  lincRNA 
work  is  now  in  preparation. 

eQTL  Analysis.  As  noted  above,  genome-wide  genotypes  and  genome-wide  mRNA  expression  levels 
were  obtained  with  the  use  of  the  lllumina  Human  Omni  2.5M  SNP  array  and  by  RNA  sequencing, 
respectively.  Following  extensive  QC,  our  final  dataset  consisted  of:  a)  471  normal  prostate  tissue  samples 
(453  from  low  Gleason  grade  PC  cases  and  18  from  Cystoprostatectomy  cases);  b)  1,542,229  SNPs;  and  c) 
17,252  expressed  genes. 

For  PC,  multiple  GWAS  and  confirmatory  studies  have  provided  a  substantial  number  of  well-validated 
SNPs  (-146)  that  are  associated  with  an  increased  risk  of  developing  PC  (Table  1).  Our  primary  analysis 
focused  on  identifying  eQTLs  for  these  146  PC  risk-SNPs,  including  all  SNPs  in  linkage  disequilibrium  with 
each  risk-SNP  (r2  >0.5),  resulting  in  a  total  of  6,324  SNPs  to  be  evaluated  in  100  unique  risk-intervals.  The 
number  of  SNPs  evaluated  for  each  of  the  risk  regions  is  shown  in  Table  2. 

Furthermore,  we  focused  on  c/s- acting  associations  only,  where  the  transcript  was  located  within  2Mb 
(+/-1Mb)  of  the  risk-SNP  interval.  A  total  of  3,142  gene  transcripts  within  these  intervals  were  identified.  Of 
these,  867  were  not  evaluated  due  to  low  or  no  expression,  leaving  2,275  for  further  analysis.  The  genes 
localized  to  each  of  these  regions  are  shown  in  Table  2. 

Of  the  6,324  SNPs  located  in  the  100  risk-intervals,  1,718  demonstrated  a  significant  eQTL  signal  after 
adjustment  for  sample  histology  (percent  lymphocytes  and  percent  epithelial  cells)  and  meeting  a  Bonferroni- 
adjusted  p-value  threshold  of  1.96e-7  (results  ranged  from  1.96e-7  to  1.52e-91).  Of  the  100  PC  risk-intervals, 

31  (31%)  demonstrated  a  significant  eQTL  signal  and  these  were  associated  with  54  genes.  Examples  for  two 
of  the  significant  eQTL  regions  of  interest  are  shown  in  Appendices  4  and  5.  Appendix  4  shows  data  for  the 
risk-SNP  region  for  rs12653946  on  Chromosome  5  (6  Kb  region)  and  the  associated  gene  identified  -  IRX4  (all 
P-value  less  than  e-40).  Appendix  5  shows  data  for  the  risk-SNP  region  for  rs81 02476  on  Chromosome  19 


(30  Kb  region)  and  the  associated  gene  identified  -  PPP1R14A  (all  P-value  less  than  e-20).  A  manuscript 
describing  the  eQTL  analysis  is  now  in  preparation. 


C.  KEY  RESEARCH  ACCOMPLISHMENTS: 

•  Tissue  processing  completed. 

•  Extraction  of  tissue  RNA  and  DNA  completed. 

•  DNA  genotyping  of  500  samples  using  the  lllumina  Human  Omni  2.5M  SNP  array  completed. 

•  RNA  sequencing  of  500  samples  using  the  Agilent  SureSelect  RNA  capture  kit  and  the  lllumina  HiSeq 
2000  completed. 

•  QC  assessment  of  both  Genotype  and  RNAseq  data  completed. 

•  Identified,  quantified  and  annotated  lincRNA  in  our  RNAseq  data  (manuscript  in  preparation). 

•  eQTL  dataset  constructed  (manuscript  in  preparation). 

•  eQTL  analysis  for  146  reported  risk-SNPs  completed  (manuscript  in  preparation). 

•  Identified  eQTL  signals  for  54  riskSNP  -  gene  combinations. 


D.  REPORTABLE  OUTCOMES: 

•  Three  manuscripts  now  in  preparation 

•  eQTL  dataset  constructed 

•  Information  from  this  DOD  grant  was  helpful  in  our  obtaining  an  NIH  award  (CA151254) 

E.  CONCLUSION: 

The  major  goal  of  this  proposal  was  to  construct  a  prostate  tissue-specific  expression  quantitative  trait 
loci  (eQTL)  dataset.  Tissue  processing,  RNA  and  DNA  purification,  DNA  genotyping  and  RNA  expression 
analysis,  and  identification  of  all  lincRNA’s  for  the  construction  of  this  eQTL  dataset  has  now  been  completed. 

We  hypothesized  that  many  of  the  PC  disease-associated  SNPs  identified  to  date  would  be  located  in 
regulatory  domains  involved  in  gene  transcription.  Furthermore,  we  hypothesized  that  candidate  genes 
affected  by  these  regulatory  elements  could  be  identified  by  taking  advantage  of  an  eQTL  dataset.  The  results 
of  this  study  show  convincing  data  that  this  is,  in  fact,  the  case. 

Of  the  6,324  SNPs  located  in  the  100  risk-intervals,  1,718  demonstrated  a  significant  eQTL  signal  after 
adjustment  for  sample  histology  (percent  lymphocytes  and  percent  epithelial  cells)  and  meeting  a  Bonferroni- 
adjusted  p-value  threshold  of  1.96e-7  (ranged  from  1.96e-7  to  1.52e-91).  Of  the  100  PC  intervals  containing  a 
PC  risk-SNP,  31  (31%)  demonstrated  a  significant  eQTL  signal  and  these  were  associated  with  54  genes. 
Thus,  54  genes  have  now  been  identified  as  candidate  risk  genes  for  prostate  cancer.  This  is  the  largest 
number  of  candidate  susceptibility  genes  found  to  date  for  prostate  cancer. 

All  aspects  of  this  grant  proposal  have  been  completed  successfully  with  very  positive  and  exciting 
results. 


F.  REFERENCES:  None 


G.  APPENDICES: 


Appendix  1:  SNP  QC  report,  for  all  SNPs  and  all  samples. 

Appendix  2:  SNP  QC  report  after  excluding  problematic  SNP  and  problematic  samples  and  includes 
additional  QC  tests. 

Appendix  3:  mRNA  QC  report 

Appendix  4:  eQTL  analysis  for  Chromosome  5  region  of  interest 
Appendix  5:  eQTL  analysis  for  Chromosome  19  region  of  interest 


H.  SUPPORTING  DATA: 

Table  1:  List  of  PC  risk-SNPs  used  for  the  study,  including  chromosome  location 
Table  2:  Number  of  SNPs  and  number  of  genes  evaluated  for  each  of  the  risk  regions 
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This  document  summarizes  GWAS  QC  analysis  performed  on  the  HumanOmni2.5-4vl  chip  for  Prostate  Cancer  patients.  Data  are 
available  for  736  samples  from  2,372,617  SNPs  including  16  CEPH  controls.  This  summary  includes  data  for  510  samples  and  2372617 
SNPs  including  16  controls. 


2  Initial  SNP  Quality  Control 

2.1  SNP  Call  Rates 

We  first  look  at  how  many  SNPs  drop  out  using  different  SNP  call  rate  cutoffs.  See  Table  0(P§  for  the  percentage  of  SNPs  retained 
as  the  call  rate  threshold  increases.  A  total  of  205  SNPs  (0.009%)  failed  completely.  Using  a  call  rate  of  98%,  28,443  SNPs  (1.2%)  will 
be  dropped.  Using  a  call  rate  of  95%,  6,409  SNPs  (0.3%)  will  be  dropped. 

2.2  Failed,  Monomorphic,  and  Low  Call  Rate  SNPs  by  Chromosome 

This  section  describes  how  many  SNPs  failed  completely,  are  “monomorphic”,  or  have  a  call  rate  <  95%  by  chromosome  and  overall 
(Table  [2j  p.  [8]).  First  “failed”  SNPs  are  identified,  then  “Monomorphic”,  and  finally  those  SNPs  with  a  call  rate  <  0.95%.  The 
distribution  of  SNP  call  rates  by  chromosome  is  presented  in  Figure  [l]  (p.  [4]). 

2.3  Minor  Allele  Frequency 

The  distribution  of  minor  allele  frequencies  (MAFs)  for  all  SNPs  is  shown  in  Figure  [2]  (p.  [5|.  There  are  a  total  of  456,321  (19.23%) 
monomorphic  SNPs  and  809,688  (34.13%)  SNPs  with  MAF  <  1%. 

2.4  Hardy  Weinberg  P-value 

This  dataset  does  not  include  controls  to  reliably  test  for  Hardy- Weinberg  Equilibrium  so  the  following  results  should  be  interpreted 
with  caution.  We  include  only  Caucasian  subjects  resulting  in  494  independent  subjects.  Chromosomes  X,  Y,  XY,  and  MT  markers 
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Figure  1:  SNP  Call  Rates  by  Chromosome 
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Tabic  T  SNP  Call  Rates 

CallRate  NumSNPsBelow  %Bclow  NumSNPsAbove  %Above 


0.000 

205 

0. 

0.800 

2200 

0. 

0.850 

2458 

0. 

0.900 

2906 

0. 

0.910 

3111 

0. 

0.920 

3424 

0. 

0.930 

3968 

0. 

0.940 

4877 

0. 

0.950 

6409 

0. 

0.960 

9328 

0. 

0.970 

14625 

0. 

0.980 

28443 

1. 

0.990 

159173 

6. 

1.000 

901479 

38. 

2372412 

100.000 

2370417 

99.900 

2370159 

99.900 

2369711 

99.900 

2369506 

99.900 

2369193 

99.900 

2368649 

99.800 

2367740 

99.800 

2366208 

99.700 

2363289 

99.600 

2357992 

99.400 

2344174 

98.800 

2213444 

93.300 

1471138 

62.000 

,000 

TOO 

TOO 

TOO 

TOO 

TOO 

,200 

,200 

,300 

,400 

,600 

,200 

,700 

,000 


are  excluded  from  this  summary  as  are  SNPs  that  failed  on  all  samples  and  SNPs  with  MAF  <  0.05.  There  are  1,242  SNPs  have  a 
HWE  p- value  <  10e-05  (see  Figure  [3j  p.  [7|) . 


Observed  p-value 


Expected  p-value 


Table  2:  SNP  QC  Summary  by  Chromosome  -  CEPH  samples  excluded 


Chrom 

TotalSNPs 

Failed 

N  % 

Monomorphic 

N  % 

Callrate<0.95 
N  % 

Remaining 

N  % 

1 

184072 

10 

0.01 

37394 

20.31 

267 

0.15 

146401 

79.53 

2 

194126 

8 

0.00 

39033 

20.11 

245 

0.13 

154840 

79.76 

3 

163672 

16 

0.01 

31653 

19.34 

193 

0.12 

131810 

80.53 

4 

152846 

7 

0.00 

28989 

18.97 

193 

0.13 

123657 

80.90 

5 

145453 

4 

0.00 

29638 

20.38 

170 

0.12 

115641 

79.50 

6 

154686 

7 

0.00 

28652 

18.52 

259 

0.17 

125768 

81.31 

7 

129072 

5 

0.00 

24646 

19.09 

209 

0.16 

104212 

80.74 

8 

125515 

6 

0.00 

23393 

18.64 

189 

0.15 

101927 

81.21 

9 

103011 

6 

0.01 

19384 

18.82 

140 

0.14 

83481 

81.04 

10 

119408 

8 

0.01 

22824 

19.11 

163 

0.14 

96413 

80.74 

11 

116095 

4 

0.00 

23212 

19.99 

192 

0.17 

92687 

79.84 

12 

112722 

3 

0.00 

22343 

19.82 

158 

0.14 

90218 

80.04 

13 

83483 

4 

0.00 

14950 

17.91 

102 

0.12 

68427 

81.97 

14 

76510 

6 

0.01 

14566 

19.04 

105 

0.14 

61833 

80.82 

15 

72294 

3 

0.00 

13249 

18.33 

104 

0.14 

58938 

81.53 

16 

76610 

5 

0.01 

13546 

17.68 

139 

0.18 

62920 

82.13 

17 

66387 

4 

0.01 

12459 

18.77 

152 

0.23 

53772 

81.00 

18 

68552 

5 

0.01 

12196 

17.79 

90 

0.13 

56261 

82.07 

19 

47733 

3 

0.01 

8787 

18.41 

131 

0.27 

38812 

81.31 

20 

56542 

4 

0.01 

10103 

17.87 

94 

0.17 

46341 

81.96 

21 

32075 

4 

0.01 

5604 

17.47 

32 

0.10 

26435 

82.42 

22 

33310 

3 

0.01 

4993 

14.99 

105 

0.32 

28209 

84.69 

X 

55208 

34 

0.06 

12690 

22.99 

1165 

2.11 

41319 

74.84 

Y 

2561 

46 

1.80 

1887 

73.68 

14 

0.55 

614 

23.98 

XY 

418 

0 

0.00 

49 

11.72 

2 

0.48 

367 

87.80 

MT 

256 

0 

0.00 

81 

31.64 

6 

2.34 

169 

66.02 

Overall 

2372617 

205 

0.01 

456321 

19.23 

4619 

0.19 

1911472 

80.56 

9 


Table  3:  Minor  Allele  Frequency  -  CEPH  samples  and  failed  SNPs  excluded 


MAF  cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.001 

456321 

19.200 

1916091 

80.800 

0.010 

809688 

34.100 

1562724 

65.900 

0.050 

1095145 

46.200 

1277267 

53.800 

0.100 

1321988 

55.700 

1050424 

44.300 

3]  Initial  Sample  Quality  Control 


3.1  Sample  Call  Rates 


Figure  |4]  (p.  [TTj)  shows  the  call  rates  for  all  samples,  all  samples  minus  CEPH  controls,  and  CEPH  controls  using  all  SNPs  (excluding 
chromosome  Y).  Table  [4]  (p.  10)  shows  the  number  of  samples  that  exceed  various  call  rate  exclusion  thresholds.  Similarity  Table  [5] 
(p.  [To])  shows  call  rates  for  all  non-CEPH  samples,  and  Table  [6]  (p.  |I2|)  shows  call  rates  for  CEPH  samples  only.  For  example  using  a 
call  rate  of  95%,  5  samples  (1%)  will  be  dropped  and  using  a  call  rate  of  98%,  6  samples  (1.2%)  will  be  dropped. 


Table  4:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  All  Samples 


cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.950 

5 

1.000 

505 

99.000 

0.980 

6 

1.200 

504 

98.800 

0.990 

8 

1.600 

502 

98.400 

0.995 

13 

2.500 

497 

97.500 

1.000 

510 

100.000 

0 

0.000 

Table  5:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  No  CEPH 


cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.950 

5 

1.000 

489 

99.000 

0.980 

6 

1.200 

488 

98.800 

0.990 

8 

1.600 

486 

98.400 

0.995 

13 

2.600 

481 

97.400 

1.000 

494 

100.000 

0 

0.000 

Frequency 
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Table  6:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  CEPH  Only 


cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.950 

0 

0.000 

16 

100.000 

0.980 

0 

0.000 

16 

100.000 

0.990 

0 

0.000 

16 

100.000 

0.995 

0 

0.000 

16 

100.000 

1.000 

16 

100.000 

0 

0.000 

3.2  Sample  Sex  Check 


In  this  section,  information  from  Chromosomes  X  and  Y  is  used  to  estimate  sex.  Subjects  whose  reported  sex  does  not  match  the 
estimated  sex  using  SNP  data  are  presented  in  Table  [7]  (p.  13)  with  all  subjects  displayed  in  Figure  [5]  (p.  14).  Table  [F]  column 
descriptions  are  shown  below. 


•  PEDSEX:  Recorded  sex  for  this  sample  (l=Male,  2=Female) 

•  SNPSEX:  Sex  esimated  from  Chromosome  X  variants 

•  STATUS:  Displays  “PROBLEM”  or  “OK”  for  each  individual 

•  F:  Plink  chromosome  X  inbreeding  (homozygosity)  estimate 

•  No.Ygeno:  Number  of  SNVs  on  Chromosome  Y 

•  cr.chry:  Chromosome  Y  call  rate 

•  No.Xgeno:  Number  of  SNVs  on  Chromosome  X 


The  expectation  is  that  F  is  more  than  0.8  for  Males  and  less  than  0.20  for  Females.  We  would  expect  cr.chry  to  be  near  1  for  Males 
and  near  0  for  Females  (given  the  pseudo- autosomal  region  of  Chromosome  Y). 


Table  7:  Sex  Check 
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IID  FID  PEDSEX  SNPSEX  STATUS  F  No.Ygeno  cr.chry  het.chrx  No.Xgeno 


3.3  Sample  Heterozygosity 

A  histogram  of  the  overall  heterozygosity  per  sample  is  shown  in  Figure  [6j  We  also  analyzed  the  per-sample  heterozygosity  by 
chromosome.  In  Figure  [7|  (p.  [I6|,  the  horizontal  dotted  red  line  is  the  median  heterozygosity  for  all  samples. 


4  Duplicate  Concordance 


Table  8:  Duplicated  Samples 


Sample 

Number  of 
Replicates 

Matched 

Mismatch 

(missing) 

Mismatch 

(called) 

Missing 
(all  replicates) 

Total  SNPs 

Concordance 

QC1025302437 

6 

2356459 

14102 

150 

1906 

2372617 

0.99994 

QC1025302436 

5 

2356085 

16002 

184 

346 

2372617 

0.99992 

QC1025302407 

5 

2357152 

13313 

139 

2013 

2372617 

0.99994 

This  study  included  3  samples  which  were  each  run  multiple  times, 
genotypes: 

•  matched  across  all  replicates, 

•  did  not  match  due  to  missingness  in  one  or  more  replicates, 

•  were  called  differently  in  the  replicates,  or 

•  were  missing  for  all  replicates. 


In  Table  [8]  (p.  13)  we  look  at  the  number  of  SNPs  whose 


X  Chromosome  inbreeding  estimate  (homozygosity) 
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Figure  5:  Sex  assignment  verification  from  Plink.  Samples  shown  in  red  were  flagged  as  errors  by  Plink. 


Number  of  Samples 


%  Heterozygous  SNPs 


%  Heterozygous  SNPs 


16 


Figure  7:  Sample  Heterozygosity  per  Chromosome 


Appendix  2 

SNP  QC  report  after  excluding  problematic  SNP  and  problematic  samples  and  includes  additional  QC  tests 


EQTL  Test  Summary 

Inv:  SNThibodeau 
Statistics  Team:  McDonnell, Kosel 
Bioinformatics  Team:  Asmann,Middha,Hossain 

Mayo  Clinic  College  of  Medicine,  Health  Sciences  Research 

Rochester  MN  USA 

September  14,  2013 


1 


Contents 

11  Introduction! 


2  Initial  SNP  Quality  Control 

12.1  SNP  Call  Ratesl . 

2.2  Failed,  Monomorphic,  and  Low  Call  Rate  SNPs  by  Chromosome 

273  Minor  Allele  Frequency . 

274  Hardy  Weinberg  P-value . 


3 

Initial  Sample  Quality  Control 

3.1  Sample  Call  Rates  . 

3.2  Sample  Sex  Check| . 

3.3  Sample  Heterozygosity  .... 

14  Batch  Effects! 


5  PLINK  Relationship  Checking 


3 


3 

3 

3 

3 

3 

10 

10 

12 

13 

13 


20 


1  Introduction 
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This  document  summarizes  GWAS  QC  analysis  performed  on  the  HumanOmni2.5-4vl  chip  for  Prostate  Cancer  patients.  Data  are 
available  for  736  samples  from  2,372,617  SNPs  including  16  CEPH  controls.  This  summary  includes  data  for  510  samples  and  2366208 
SNPs  including  16  controls. 


o 


2  Initial  SNP  Quality  Control 

2.1  SNP  Call  Rates 

We  first  look  at  how  many  SNPs  drop  out  using  different  SNP  call  rate  cutoffs.  See  Table  0(P§  for  the  percentage  of  SNPs  retained 
as  the  call  rate  threshold  increases.  Using  a  call  rate  of  98%,  22,034  SNPs  (0.9%)  will  be  dropped.  Using  a  call  rate  of  95%,  0  SNPs 
(0%)  will  be  dropped. 

2.2  Failed,  Monomorphic,  and  Low  Call  Rate  SNPs  by  Chromosome 

This  section  describes  how  many  SNPs  failed  completely,  are  “monomorphic”,  or  have  a  call  rate  <  95%  by  chromosome  and  overall 
(Table  [2j  p.  [8]).  First  “failed”  SNPs  are  identified,  then  “Monomorphic”,  and  finally  those  SNPs  with  a  call  rate  <  0.95%.  The 
distribution  of  SNP  call  rates  by  chromosome  is  presented  in  Figure  [l]  (p.  [4]). 

2.3  Minor  Allele  Frequency 

The  distribution  of  minor  allele  frequencies  (MAFs)  for  all  SNPs  is  shown  in  Figure  [2]  (p.  [5|.  There  are  a  total  of  454,736  (19.22%) 
monomorphic  SNPs  and  807,572  (34.13%)  SNPs  with  MAF  <  1%. 

2.4  Hardy  Weinberg  P-value 

This  dataset  does  not  include  controls  to  reliably  test  for  Hardy- Weinberg  Equilibrium  so  the  following  results  should  be  interpreted 
with  caution.  We  include  only  Caucasian  subjects  resulting  in  494  independent  subjects.  Chromosomes  X,  Y,  XY,  and  MT  markers 


0.75  0.80  0.85  0.90  0.95  1.00 
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Figure  1:  SNP  Call  Rates  by  Chromosome 
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Tabic  T  SNP  Call  Rates 

CallRate  NumSNPsBelow  %Bclow  NumSNPsAbove  %Above 
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are  excluded  from  this  summary  as  are  SNPs  that  failed  on  all  samples  and  SNPs  with  MAF  <  0.05.  There  are  902  SNPs  have  a  HWE 
p- value  <  10e-05  (see  Figure  [3j  p.[7]). 


Observed  p-value 


Expected  p-value 


Table  2:  SNP  QC  Summary  by  Chromosome  -  CEPH  samples  excluded 


Failed 

Monomorphic 

Callr  ate  <0.95 

Remaining 

Chrom 

TotalSNPs 

N 

% 

N 

% 

N 

% 

N 

% 

1 

183728 

0 

0.00 

37327 

20.32 

0 

0.00 

146401 

79.68 

2 

193824 

0 

0.00 

38984 

20.11 

0 

0.00 

154840 

79.89 

3 

163427 

0 

0.00 

31617 

19.35 

0 

0.00 

131810 

80.65 

4 

152609 

0 

0.00 

28952 

18.97 

0 

0.00 

123657 

81.03 

5 

145233 

0 

0.00 

29592 

20.38 

0 

0.00 

115641 

79.62 

6 

154374 

0 

0.00 

28606 

18.53 

0 

0.00 

125768 

81.47 

7 

128819 

0 

0.00 

24607 

19.10 

0 

0.00 

104212 

80.90 

8 

125280 

0 

0.00 

23353 

18.64 

0 

0.00 

101927 

81.36 

9 

102842 

0 

0.00 

19361 

18.83 

0 

0.00 

83481 

81.17 

10 

119219 

0 

0.00 

22806 

19.13 

0 

0.00 

96413 

80.87 

11 

115865 

0 

0.00 

23178 

20.00 

0 

0.00 

92687 

80.00 

12 

112532 

0 

0.00 

22314 

19.83 

0 

0.00 

90218 

80.17 

13 

83353 

0 

0.00 

14926 

17.91 

0 

0.00 

68427 

82.09 

14 

76390 

0 

0.00 

14557 

19.06 

0 

0.00 

61833 

80.94 

15 

72174 

0 

0.00 

13236 

18.34 

0 

0.00 

58938 

81.66 

16 

76447 

0 

0.00 

13527 

17.69 

0 

0.00 

62920 

82.31 

17 

66220 

0 

0.00 

12448 

18.80 

0 

0.00 

53772 

81.20 

18 

68440 

0 

0.00 

12179 

17.80 

0 

0.00 

56261 

82.20 

19 

47589 

0 

0.00 

8777 

18.44 

0 

0.00 

38812 

81.56 

20 

56429 

0 

0.00 

10088 

17.88 

0 

0.00 

46341 

82.12 

21 

32030 

0 

0.00 

5595 

17.47 

0 

0.00 

26435 

82.53 

22 

33196 

0 

0.00 

4987 

15.02 

0 

0.00 

28209 

84.98 

X 

53137 

0 

0.00 

11818 

22.24 

0 

0.00 

41319 

77.76 

Y 

2386 

0 

0.00 

1772 

74.27 

0 

0.00 

614 

25.73 

XY 

416 

0 

0.00 

49 

11.78 

0 

0.00 

367 

88.22 

MT 

249 

0 

0.00 

80 

32.13 

0 

0.00 

169 

67.87 

Overall 

2366208 

0 

0.00 

454736  19.22 

0 

0.00 

1911472 

80.78 
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Table  3:  Minor  Allele  Frequency  -  CEPH  samples  and  failed  SNPs  excluded 


MAF  cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.001 

454736 

19.200 

1911472 

80.800 

0.010 

807572 

34.100 

1558636 

65.900 

0.050 

1092475 

46.200 

1273733 

53.800 

0.100 

1318885 

55.700 

1047323 

44.300 

3]  Initial  Sample  Quality  Control 


3.1  Sample  Call  Rates 

Figure  [4]  (p.  [TTj)  shows  the  call  rates  for  all  samples  using  all  SNPs  (excluding  chromosome  Y).  Table  [4]  (p.  10)  shows  the  number  of 
samples  that  exceed  various  call  rate  exclusion  thresholds.  Similarity  Table  [5]  (p.  10)  shows  call  rates  for  all  non-CEPH  samples,  and 
Table  [6]  (p.  12)  shows  call  rates  for  CEPH  samples  only.  For  example  using  a  call  rate  of  95%,  5  samples  (1%)  will  be  dropped  and 
using  a  call  rate  of  98%,  6  samples  (1.2%)  will  be  dropped. 


Table  4:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  All  Samples 


cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.950 

5 

1.000 

489 

99.000 

0.980 

6 

1.200 

488 

98.800 

0.990 

7 

1.400 

487 

98.600 

0.995 

12 

2.400 

482 

97.600 

1.000 

494 

100.000 

0 

0.000 

Table  5:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  No  CEPH 


cutoff 

Ndrop 

%Drop 

Nkeep 

%Keep 

0.950 

5 

1.000 

489 

99.000 

0.980 

6 

1.200 

488 

98.800 

0.990 

7 

1.400 

487 

98.600 

0.995 

12 

2.400 

482 

97.600 

1.000 

494 

100.000 

0 

0.000 

Frequency 


Figure  4:  Histogram  o 
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Table  6:  Number  of  Samples  Dropped  by  Call  Rate  Threshold  (Y  chromosome  excluded)  CEPH  Only 


cutoff  Ndrop  %Drop  Nkeep  %Keep 


0.950 

0 

0 

0.980 

0 

0 

0.990 

0 

0 

0.995 

0 

0 

1.000 

0 

0 

3.2  Sample  Sex  Check 


In  this  section,  information  from  Chromosomes  X  and  Y  is  used  to  estimate  sex.  Subjects  whose  reported  sex  does  not  match  the 
estimated  sex  using  SNP  data  are  presented  in  Table  [7]  (p.  13)  with  all  subjects  displayed  in  Figure  [5]  (p.  14).  Table  [F]  column 
descriptions  are  shown  below. 


•  PEDSEX:  Recorded  sex  for  this  sample  (l=Male,  2=Female) 

•  SNPSEX:  Sex  esimated  from  Chromosome  X  variants 

•  STATUS:  Displays  “PROBLEM”  or  “OK”  for  each  individual 

•  F:  Plink  chromosome  X  inbreeding  (homozygosity)  estimate 

•  No.Ygeno:  Number  of  SNVs  on  Chromosome  Y 

•  cr.chry:  Chromosome  Y  call  rate 

•  No.Xgeno:  Number  of  SNVs  on  Chromosome  X 


The  expectation  is  that  F  is  more  than  0.8  for  Males  and  less  than  0.20  for  Females.  We  would  expect  cr.chry  to  be  near  1  for  Males 
and  near  0  for  Females  (given  the  pseudo- autosomal  region  of  Chromosome  Y). 


Table  7:  Sex  Check 
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IID  FID  PEDSEX  SNPSEX  STATUS  F  No.Ygeno  cr.chry  het.chrx  No.Xgeno 


3.3  Sample  Heterozygosity 

A  histogram  of  the  overall  heterozygosity  per  sample  is  shown  in  Figure  [6]  We  also  analyzed  the  per-sample  heterozygosity  by 
chromosome.  In  Figure  [7|  (p.  [I6|,  the  horizontal  dotted  red  line  is  the  median  heterozygosity  for  all  samples. 

4  Batch  Effects 


Table  8:  Plate  Mapping 
WG0232831-DNA  1 

WG0232832-DNA  2 
WG0232833-DNA  3 
WG0232834-DNA  4 
WG0232835-DNA  5 
WG0232836-DNA  6 
WG0232837-DNA  7 
WG0232838-DNA  8 


Table  [8]  (p.  13)  will  act  as  map  for  the  following  batch  effect  plots  regarding  Plate.  To  test  for  Plate  effects  in  variant  calling,  we 
performed  a  chi-squared  test  for  each  SNP  comparing  the  allele  frequency  estimated  using  samples  on  one  Plate  to  the  allele  frequency 
estimated  from  the  remaining  Plates.  We  then  took  the  mean  of  the  chi-squared  statistics  for  each  Plate  across  all  SNPs.  The  numbers 
in  the  plot  (Figure  [8])  (p.  [IT])  indicates  Plate.  Figure  [9]  (p.  18)  shows  boxplots  of  the  sample  call  rate  for  each  Plate.  The  dashed 
horizontal  line  is  drawn  at  the  98%  percentile  of  missingness  rates  for  the  SNPs  used  in  the  figure.  Figure  [To]  (p.  [I9|)  shows  boxplots  of 
the  sample  heterozygosity  rate  for  each  Plate.  The  dashed  horizontal  line  is  drawn  at  the  median  heterozygosity  rate  across  samples. 


X  Chromosome  inbreeding  estimate  (homozygosity) 
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Figure  5:  Sex  assignment  verification  from  Plink.  Samples  shown  in  red  were  flagged  as  errors  by  Plink. 
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%  Heterozygous  SNPs 


%  Heterozygous  SNPs 
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Figure  7:  Sample  Heterozygosity  per  Chromosome 


Mean  Autosomal  Call  Rate 


Figure  8:  Test  for  Batch  Effects 
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Sample  Missingness  Rate 


18 


Figure  9:  Sample  Call  Rate  by  Plate 
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Sample  Heterozygosity 
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Figure  10:  Sample  Heterozygosity  by  Plate 
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Plate 


W  PLINK  Relationship  Checking 


This  study  consists  of  494  presumed  unrelated  individuals.  Relationship  checking  was  performed  by  estimating  the  proportion  of  alleles 
shared  identical  by  descent  (IBD)  for  all  pairs  of  subjects.  PLINK  was  used  to  estimate  IBD.  Independent  SNPs  were  selected  for 
analysis  by  first  excluding  all  SNPs  with  callrate  <  0.95%,  MAF  <  0.05%,  and  HWE  pvalue  <  le-06.  Remaining  SNPs  were  pruned 


using  Plink  such  that  pairwise  correlation  between  SNPs  (r2)  is  less  than  0.01.  A  total  of  21395  were  used  for  this  analysis.  Figure  11 
(p.  22)  shows  the  IBD  plot  for  all  study  samples.  If  this  study  includes  both  related  and  unrelated  samples,  then  panel  A  shows  the 


unrelated  samples  and  panel  B  shows  related  samples.  Relationship  codes  shown  in  Figure  [IT]  along  with  their  expected  IBD  sharing 
are  shown  below. 


CODE 

RELATIONSHIP 

E(IBD0) 

E(IBDl) 

E(IBD2) 

P0 

Parent-Off spring 

0 

1.00 

0 

FS 

Full-Sibling 

0.25 

0.50 

0.25 

HS 

Half-Sibling 

0.50 

0.50 

0 

AV 

Avuncular 

0.50 

0.50 

0 

GPC 

Grandparent-grandchild 

0.50 

0.50 

0 

FC 

First-Cousin 

0.75 

0.25 

0 

HA 

Half -Avuncular 

0.75 

0.25 

0 

HFC 

Half -First-Cousin 

0.875 

0.125 

0 

HSFC 

Half-Sib+First-Cousin 

0.375 

0.50 

0.125 

U 

Unrelated 

1.00 

0 

0 

Table  9:  Check  for  Cryptic  relatedness:  Unrelated  pairs 


FID1 

IID1 

FID2 

IID2 

ZO 

Z1 

Z2 

PLHAT 

RT 

Obs.RT 

1213802311 

1213802311 

1211702138 

1211702138 

0.7714 

0.2243 

0.0044 

0.1165 

U 

FC 

1213802311 

1213802311 

1211001831 

1211001831 

0.7812 

0.2188 

0.0000 

0.1094 

u 

FC 

1213802218 

1213802218 

1211702092 

1211702092 

0.7671 

0.2329 

0.0000 

0.1164 

u 

FC 

1211800763 

1211800763 

1211702138 

1211702138 

0.7087 

0.2105 

0.0808 

0.1861 

u 

Q 

1211800763 

1211800763 

1213802245 

1213802245 

0.7112 

0.2083 

0.0805 

0.1846 

U 

Q 

1211800763 

1211800763 

1211001831 

1211001831 

0.7308 

0.1815 

0.0876 

0.1784 

u 

Q 

1211702138 

1211702138 

1213802245 

1213802245 

0.7294 

0.1771 

0.0935 

0.1820 

u 

Q 

1211702138 

1211702138 

1211001831 

1211001831 

0.6546 

0.2433 

0.1021 

0.2237 

u 

Q 

1211001818 

1211001818 

1211800765 

1211800765 

0.6586 

0.2218 

0.1196 

0.2305 

u 

Q 

1211001818 

1211001818 

1211702155 

1211702155 

0.6811 

0.2368 

0.0821 

0.2005 

u 

Q 

1211001818 

1211001818 

1213103091 

1213103091 

0.7526 

0.2130 

0.0345 

0.1409 

u 

FC 

1213802245 

1213802245 

1211001831 

1211001831 

0.7388 

0.1948 

0.0665 

0.1639 

u 

Q 

1211800765 

1211800765 

1211702155 

1211702155 

0.6784 

0.2418 

0.0799 

0.2007 

u 

Q 

1211800765 

1211800765 

1213103091 

1213103091 

0.7724 

0.1935 

0.0340 

0.1308 

u 

FC 

1211702155 

1211702155 

1213103091 

1213103091 

0.7657 

0.1958 

0.0385 

0.1364 

u 

FC 

All  pairs  of  unrelated  subjects  with  the  probability  of  sharing  0  alleles  IBD  <  0.80  are  shown  in  Table  [9]  (p.  21).  There  are  15 
pairs  of  unrelated  subjects  who  have  higher  than  expected  IBD  sharing.  Related  pairs  whose  IBD  sharing  does  not  match  expected 
are  shown  in  Table  ??  (p.  ??).  All  relative  pairs  where  the  absolute  value  of  expected  minus  observed  sharing  is  greater  than  0.25  for 
any  of  the  IBD  sharing  probabilities  is  included.  These  tables  includes  both  the  expected  relationship  type  (column  labelled  ’ RT ’)  and 
the  observed  relationship  type  based  on  estimated  IBD  probabilities  (column  labelled  7  Obs.RT1).  There  are  0  pairs  of  related  subjects 
whose  relationships  appear  to  be  different  than  expected.  Relationship  codes  shown  in  these  tables  are  described  on  page  [20 
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Figure  11:  Estimated  IBD  sharing  between  all  pairs  of  subjects.  If  study  includes  pedigrees,  then  the  IBD  sharing  is  split  into  two 
panels:  Panel  A  includes  all  unrelated  pairs  of  subjects  and  Panel  B  includes  all  related  pairs  within  pedigrees.  Each  relationship  is 
displayed  in  a  different  symbol  and  color.  Relationship  codes  are  described  on  page 
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1  Introduction 


This  document  describes  the  mRNA-seq  quality  control  checks  and  initial  analysis  performed 
for  the  “Thibodeau  eQTL  mRNA  NGS  QC”  project.  A  total  of  493  subjects  contributed  493 
samples  consisting  of  N=19  cystoprostatectomy  samples,  N=474  low  gleason  samples.  493 
subject (s)  gave  1  samples.  There  are  0  repeated  samples  ().  Samples  were  run  up  to  5  per 
lane,  with  the  groupings  listed  in  Table  [l] 

There  were  23,398  Genes  presented  in  the  original  data  (46  Genes  mapped  to  2  different 
chromosomes  and  3  Genes  mapped  to  3  different  chromosomes).  Of  all  the  genes,  780  (3.3%) 
had  no  counts  for  all  samples  and  were  removed  from  further  analysis  (genes  deemed  un¬ 
detectable/noise).  The  remaining  genes  were  distributed  across  all  the  chromosomes  (Table 
[2]).  For  genes  that  mapped  to  both  chromosome  X  and  Y,  only  the  chromosome  X  version 
was  retained.  After  filtering,  there  was  only  3  gene  (FAM45B,  MIR1256,  TTL)  mapped  to 
more  than  1  location  (chrlO,  chrX,  chrlO,  chrX,  chrl3,  chr2).  Additionally,  there  were  still 
37  Genes  that  mapped  to  chromosome  Y  (AMELY,  BCORP1,  CD24,  CSPG4P1Y,  DDX3Y, 
EIF1AY,  GYG2P1,  KDM5D,  LINC00230A,  NCRNA00185,  NLGN4Y,  PCDH11Y,  PRKY, 
RBMY1A3P,  RBMY2EP,  RBMY2FP,  RPS4Y1,  RPS4Y2,  SRY,  TBL1Y,  TMSB4Y,  TSPY1, 
TSPY2,  TTTY10,  TTTY12,  TTTY13,  TTTY14,  TTTY15,  TTTY16,  TTTY18,  TTTY19, 
TTTY22,  TTTY5,  TXLNG2P,  USP9Y,  UTY,  ZFY). 
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Flowccll  Run. Name 

T  121112_SN7001166_0111_BD1KD4ACXX 

2  121112_SN7001166_0111_BD1KD4ACXX_: 

3  121116_SN725_0269_BD1KC5ACXX 

4  121120_SN414_0250_AC1F36ACXX 

5  121120_SN414_0251_BD1KDGACXX 

6  121128_SN7001166_0114_AD1K24ACXX 

7  121129_SN616_0231_AC1GC0ACXX 

8  121129_SN616_0232_BD1K1UACXX 

9  121130_SN414_0256_AD1M44ACXX 

10  121205_SN725_0272_AC1H54ACXX 
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121205_SN725_0273_BD1M9VACXX 


Subjects  N 

s_10,s_114,s_142,s_202,s_21,s_23,s_280,s_313  21 

s_341,s_344,s_360,s_378,s_435,s_449,s_452,s_459 

s_471,s_501,s_511,s_547,s_61 

s_549,s_87  2 

s_104,s_141,s_172,s_176,s_224,s_354,s_375,s_392  22 

s_398,s_405,s_410,s_414,s_42,s_432,s_450,s_453 

s_504,s_506,s_516,s_539,s_65,s_80 

s_ll,s_110,s_12,s_173,s_196,s_238,s_35,s_394  18 

s_404,s_422,s_423,s_438,s_444,s_451,s_472,s_479 

s_532,s_536 

s_106,s_160,s_165,s_169,s_217,s_218,s_239,s_24  24 

s_246,s_249,s_258,s_301,s_339,s_355,s_36,s_370 


s_400,s_419,s_443,s_478,s_486,s_497,s_510,s_527 
s_133,s_163,s_166,s_187,s_198,s_226,s_27,s_270  28 

s_274,s_276,s_286,s_304,s_307,s_314,s_324,s_383 
s_41,s_437,s_474,s_492,s_509,s_541,s_546,s_77 
s_9,s_95,s_96,s_98 

s_126,s_145,s_155,s_182,s_194,s_260,s_272,s_275  23 

s_279,s_285,s_288,s_321,s_34,s_372,s_441,  s_446 

s_447,s_477,s_483,s_507,s_553,s_556,s_70 

s_167,s_241,s_338,s_365,s_476,s_498,s_62,s_86  8 

s_l,s_119,s_153,s_156,s_157,s_266,s_268,s_31  24 

s_343,s_348,s_367,s_4,s_402,s_408,s_465,s_484 

s_519,s_525,s_551,s_558,s_60,s_76,s_78,s_82 

s_105,s_118,s_137,s_140,s_147,s_168,s_181,s_183  26 

s_191,s_2,s_232,s_264,s_294,s_333,s_352,s_387 

s_388,s_393,s_417,s_448,s_488,s_49,s_496,s_50 

s_512,s_562 

s_152,s_171,s_178,s_210,s_25,s_269,s_287,s_337  23 

s_347,s_366,s_377,s_440,s_467,s_482,s_490,s_534 

s_538,s_542,s_59,s_84,s_89,s_91,s_94 


Flowccll  Run. Name 

12  1 212 13_SX725_0275_BC  1 GGBACXX 


13  121214_SN7001166_0118_AD1LW9ACXX 

14  121214_SN7001166_0119_BD1M77ACXX 

15  121218_SN616_0237_AD1M5BACXX 

16  130104_SN7001166_0126_AC1MU4ACXX 


17  130104_SN7001166_0127_BC1N0KACXX 


18  130104_SN7001166_0127_BC1N0KACXX_2 

19  130111_SN7001166_0128_AD1NCWACXX 

20  130125_SN316_0280_BC1KPWACXX 

21  MERGE_3_28_2013-1 
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MERGE_3_28_2013-2 


Subjects  N 

s_100,s_101,s_109,s_113,s_121,s_125,s_13,s_134  30 

s_144,s_17,s_185,s_195,s_243,s_326,s_340,s_380 

s_409,s_413,s_43,s_458,s_466,s_475,s_480,s_5 

s_505,s_522,s_530,s_79,s_81,s_97 

s_131,s_15,s_158,s_177,s_19,s_193,s_253,s_259  22 

s_319,s_32,s_33,s_373,s_382,s_397,s_407,s_421 

s_425,s_461,s_513,s_550,s_7,s_75 

s_123,s_129,s_235,s_282,s_316,s_346,s_357,s_386  15 

s_390,s_395,s_468,s_52,s_535,s_555,s_63 

s_115,s_116,s_151,s_18,s_180,s_205,s_255,s_257  27 

s_290,s_293,s_317,s_318,s_359,s_368,s_412,s_415 

s_427,s_442,s_45,s_469,s_47,s_515,s_526,s_548 

s_56,s_68,s_85 

s_lll,s_135,s_149,s_174,s_209,s_215,s_221,s_229  33 

s_278,s_30,s_308,s_310,s_315,s_363,s_364,s_385 

s_396,s_406,s_481,s_489,s_491,s_493,s_495,s_514 

s_518,s_528,s_537,s_543,s_545,s_57,s_64,s_69 

s_92 

s_102,s_112,s_122,s_124,s_132,s_138,s_143,s_199  20 
s_22,s_234,s_320,s_327,s_329,s_369,s_381,s_39 
s_40 3 ,  s_4 1 6  ,s_44 ,  s_46 

s_533  1 

s_161,s_291,s_349,s_433,s_434,s_456,s_503,s_53  8 

s_148,s_162,s_170,s_201,s_216,s_263,s_38,s_384  15 

s_40,s_430,s_485,s_6,s_72,s_74,s_93 
s_108,s_117,s_127,s_128,s_136,s_16,s_184,s_186  21 

s_188,s_189,s_203,s_206,s_212,s_213,s_227,s_233 
s_247,s_254,s_261,s_265,s_267 
s_28,s_281,s_306,s_311,s_312,s_323,s_325,s_328  33 

s_330,s_336,s_345,s_350,s_351,s_361,s_362,s_374 
s_376,s_391,s_401,s_424,s_426,s_428,s_439,s_460 


Flowccll 

Run.  Name 

Subjects 

N 

s_464,s_499,s_517,s_554,s_565,s_71,s_8,s_83 

s_99 

23 

MERGE_3_28_2013-3 

s_120,s_150,s_164,s_190,s_192,s_197,s_200,s_208 

s_214,s_228 

10 

24 

MERGE_3_28_2013-4 

s_231,s_237,s_242,s_245,s_248,s_250,s_252,s_256 

s_26,s_271,s_273,s_277,s_283,s_289,s_295,s_297 

s_298,s_322,s_332,s_342,s_358,s_389,s_411,s_418 

s_420,s_431,s_445,s_455,s_457,s_463,s_470,s_473 

s_523,s_524,s_55,s_557,s_58,s_88 

38 

25 

MERGE_3_28_2013-5 

s_3 

1 

Table  1:  Samples  in  each  Flowcell 


chrOl 

chr02 

chr03 

chr04 

chr05 

chr06 

chr07 

chr08 

chr09 

2279 

1447 

1226 

854 

993 

1184 

1086 

780 

925 

chrlO 

chrll 

chrl2 

chrl3 

chrl4 

chrl5 

chrl6 

chrl7 

chrl8 

881 

1405 

1152 

385 

759 

770 

916 

1326 

319 

chrl9 

chr20 

chr21 

chr22 

chrX 

chrY 

1535 

644 

286 

528 

881 

37 

Table  2:  Chromosome  distribution  of  Genes 


Summaries  of  the  log2  (counts)  and  %counts  >  0  by  subject,  by  flowcell,  by  group,  by 
%GC content ,  and  by  gene  size  (counting  only  the  sum  of  the  exons)  are  included  in  the 
following  sections.  These  factors  can  influence  then  number  of  counts  observed 


2  Assessing  log2  (Gene  Counts) 


2.1  By  Subject  and  Lane 


Figure  [I]  shows  the  distribution  of  Gene  Counts  separately  for  each  subject  via  boxplots. 
The  plots  are  color-coded  to  indicate  tumor  type.  Because  the  values  are  presented  on  a  log2 
scale,  the  Gene  Counts  is  actually  the  Gene  Counts  +  1  so  that  those  genes  with  a  count  of 
zero  are  also  included  in  the  figure.  Figure  [2]  and  [3]  to  27  shows  the  same  subjects,  but 
this  time  the  boxes  are  color-coded  by  RunID.  The  hope  is  that  the  boxplots  are  relatively 
consistent  across  all  the  subjects, 
line  graph.  Figure 


Figure 


to  52  shows  the  distribution  of  gene  counts  via 
53]  shows,  for  each  subject,  the  sum  of  all  the  Gene  Counts.  Lines  are 


used  to  separate  subjects  by  RunID.  The  red  line  in  the  middle  of  the  dots  is  the  median  of 
each  RunID. 
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log2(Gene  Counts  + 1 )  log2(Gene  Counts  ♦  1 )  log2(Gene  Counts  +  1 ) 
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Figure  1:  Distribution  of  log2(Gene  Counts)  for  each  Subject  color  -coded  by  Group 
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log2(Gene  Counts  + 1 )  k>g2(Gene  Counts  ♦  1 )  log2(Gene  Counts  +  1 ) 
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Figure  2:  Distribution  of  log2(Gene  Counts)  for  each  Subject  color  -coded  by  RunID 
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Figure  3:  Distribution  of  log2(Total  Gene  Counts)  for  each  Subject  by  RunID 
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log2(Gene  Counts  +  1) 


Figure  4:  Distribution  of  log2(Total  Gene  Counts)  for  each  Subject  by  RunID 


10 


121112  SN7001166  0111  BD1KD4ACXX 


Figure  28:  Distribution  of  log2(Total  Gene  Counts)  for  each  Subject  by  RunID 
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Figure  52:  Distribution  of  log2(Total  Gene  Counts)  for  each  Subject  by  RunID 


2.2  By  GC  Content 


Because  GC  Content  is  known  to  impact  expression  levels  and  can  be  impacted  by  PCR,  it 
is  important  to  evaluate  whether  there  are  individual  subjects  that  show  overall  Gene  Count 
levels  that  vary  by  %GC.  Figure  54  shows  a  smoothed  color  density  representation  of  the 
scatterplot  with  %GC  on  the  x-axis  and  log2(GeneC ount)  on  the  y-axis.  A  loess  smoother 
line  is  shown  indicating  the  general  pattern  of  all  the  Gene  Count  values  for  this  particular 
subject.  Similarly,  Figure  [55]to  79  shows  the  loess  smoother  line  for  each  subject.  Based  on 


this  plot,  it  appears  that  the  overall  pattern  is  similar  for  all  samples.  Figure  80  shows  the 


distribution  of  log2(Gene  Count+1)  by  deciles  of  %GC  by  flowcell.  Again,  there  is  clearly  a 
lower  Gene  Count  when  the  %GC  is  higher,  but  the  patterns  are  similar  for  most  samples. 
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Figure  53:  Distribution  of  Total  Gene  Counts)  for  each  Subject  by  RunID 
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log2(Gene  Count  + 1) 
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121112  SN7001166  0111  BD1KD4ACXX 


%GC 


Figure  55:  Distribution  of  Percent  GC  versus  log2(Gene  Count  +  1)  with  a  loess  smoother  for  each  subject  by  flowcell 
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Figure  56:  Distribution  of  Percent  GC  versus  log2(Gene  Count  +  1)  with  a  loess  smoother  for  each  subject  by  flowcell 
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Figure  57:  Distribution  of  Percent  GC  versus  log2(Gene  Count  +  1)  with  a  loess  smoother  for  each  subject  by  flowcell 
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Figure  80:  Distribution  of  log2(Gene  Count+1)  by  deciles  of  %GC  and  flowccll 


2.3  By  Gene  Size 

Gene  Size  is  known  to  impact  expression  levels  and  hence  it  is  important  to  assess  overall 


Gene  Count  levels  by  Gene  size.  Figure  81  shows  boxplots  of  Gene  Counts  by  quintiles  of 


shows  boxplots  of  Gene  Counts  by  quintiles  of  Gene  size  and  flowcells, 


Gene  size,  Figure 
and  Figure 

subject.  Patterns  differ  by  size  but  there  is  no  extreme  ouliers. 


83  shows  the  distribution  of  log2(Gene  Count+1)  with  smoothed  lines  for  each 
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Figure  81:  Distribution  of  log2(Gene  Count+1)  by  Gene  Size  (5  groups) 
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Figure  82:  Distribution  of  log2(Gene  Count+1)  by  flowcell  and  Gene  Size  (5  groups)  color-coded  by  flowcell 
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Figure  83:  Distribution  of  log2(Gene  Count+1)  by  Gene  Size.  Lowess  smoothed  lines  are  shown  for  each  subject 


2.4  Individual  Gene  Counts  versus  the  average  Gene  Count 


Finally,  it  is  useful  to  look  at  how  individual  Gene  Counts  differ  from  the  average  (Figure 
84b. 
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Figure  84:  MA  Plot  showing  the  difference  of  log2(Gene  Count+1)  -  mean (log2( Gene  Count+1))  versus  mean(log2(G< 
Count+1)).  Lowess  smoothed  lines  are  shown  for  each  subject  and  color  coded  by  flowcelh 


3  Normalizing  Data 


In  much  of  the  literature  RPKM  (reads  per  kilobase  per  million)  has  been  used  to  normalize 
the  mRNA-seq  count  data.  The  objective  is  to  take  into  account  the  fact  that  some  runs, 
because  of  the  application  step,  are  going  to  produce  higher  counts.  Additionally,  this 
approach  takes  into  account  the  fact  that  some  genes  are  larger  than  others  and  therefore  will 
have  larger  counts.  Count  data  typically  is  analyzed  assuming  either  a  Poisson  or  Negative 
Binomial  distribution.  Unfortunately,  RPKM  changes  the  underlying  structure  of  the  data 
and  renders  the  distributional  assumptions  invalid  when  directly  adjusting  the  ratio.  The 
preferred  approach  is  to  model  the  original  gene  counts  and  adjust  for  additional  factors  by 
means  of  an  offset  in  a  Negative  Binomial  model. 

The  RPKM  for  a  given  sample  (subject)  is  as  follows: 

C  =  Number  of  reads  mapped/assigned  to  a  gene  for  that  sample 
L  =  exon  length  in  base-pairs  for  a  gene 
N  =  Total  mapped  reads  for  the  sample 

These  are  combined  in  the  equation  for  RPKM  =  (109  *  C)/(N  *  L) 


3.1  CQN  normalization 


Recent  publications  have  shown  that  %GC  content  can  have  a  large  impact  on  Gene  Counts 
and  may  need  to  be  accounted  for  in  the  analysis.  The  CQN  approach  uses  the  %GC  Content 
in  addition  to  total  mapped  reads  and  Gene  Length  to  create  an  appropriate  offset  variable 
for  each  subject-gene  combination. 

The  CQN  package  in  R  was  used  to  estimate  an  offset  for  each  subject  and  gene  combi¬ 
nation,  taking  into  account  exon  length  (gene  size)  for  each  gene,  %GC  content,  and  total 
mapped  reads  for  each  subject.  This  offset  was  then  used  in  the  edgeR  package  in  R  to  run 
the  analysis  testing  for  group  differences.  Figures  [85}  [86}  [87}  and  [88]  show  QC  plots  after 


normalization  (per  subject,  by  GC  Content,  by  Gene  size,  and  Mean  vs  Average). 


3.2  Sample  Filters 

A  total  of  493  passed  sample  QC  Liters.  0  sample  did  not  pass  QC  Liters  and  will  be  removed 
from  further  analysis.  Table  [3]  shows  the  excluded  sample  and  the  reason  for  exclusion. 

SamplelD  Use.  Status  Eexclude.  Reason 
Table  3:  List  of  Excluded  Samples 
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Figure  85:  Distribution  of  normalized  Gene  Counts/million  (on  log2  scale)  for  each  subject. 
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Figure  86:  Distribution  of  normalized  Gene  Count  (on  log2  scale)  by  GC  Content.  Lowess 
smoothed  lines  are  shown  for  each  subject 
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Figure  87:  Distribution  of  normalized  Gene  Count  (on  log2  scale)  by  Gene  Size.  Lowess 
smoothed  lines  are  shown  for  each  subject 
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Figure  88:  MA  Plot  showing  the  difference  of  the  normalized  Gene  Count  -  mean(normalized 
Gene  Count)  versus  mean(normalized  Gene  Count).  Lowess  smoothed  lines  are  shown  for 
each  subject. 
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Figure  89:  Distribution  of  log2(Gene  Counts  +  1)  for  each  Subject  by  filtering 


3.3  Gene  Filters 

Of  the  remaining  genes  with  at  least  1  count,  5,225  (23.1%)  had  a  median  count  of  less 
than  16  in  the  analysis  groups  and  were  removed  from  further  analysis  (genes  deemed  unde¬ 
tectable/noise).  This  filter  was  applied  on  the  raw  count  data.  The  normalized  count  data 
will  not  to  done  again,  we  will  simply  remove  the  filtered  out  genes  prior  to  analysis.  Figure 


filtering  for  low  gene  count. 


89  shows  the  distribution  of  the  Log2(Gene  Count  +  1)  for  each  subject  before  and  after 
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Appendix  4:  eQTL  analysis  for  Chromosome  5  region  of  interest 

CIS:  chr5.1 795829.1 995829.220.CIS  TRANS:  chr5.1 795829.1 995829.220.TRANS 


CD 

CCS 

> 

I 

Q. 

T3 

CD 

> 

CD 

C/) 

_Q 

O 


Expected  p-value  Expected  p-value 

NSNP  =  12  ;  Ngene  =  16 


-loglO(P) 


chr5.1 795829.1 995829.220 
Bonferroni  Pvalue  =  1.96e-07 


1.891  1.892  1.893  1.894  1.895  1.896  1.897 


Mb 


OIRX4 

XTRIP13 

BRD9 

CLPTM1L 

LPCAT1 

L  C5orf38 

MRPL36 

<>-OC728613 

IRX2 

SDHAP3 

0SLC12A7 

4-NDUFS6 

'  ZDHHC1 1 

LOCI  00506688  NKD2 

OSLC6A19 

Normalized  Expression  Normalized  Expression 


rs12655062_A  :  IRX4 
-loglO(P)  =  62.4 


0.0  0.5  1.0  1.5  2.0 

N  =  21 0  N  =  1 99  N  =  62 


rsl  0866527  T  :  IRX4 
-log10(P)  =  50.42 


0.0  0.5  1.0  1.5  2.0 

N  =  145  N  =  21 6  N  =  1 1 0 


Normalized  Expression  Normalized  Expression 


rs4975758_G  :  IRX4 
-log10(P)  =  51.58 


0.0  0.5  1.0  1.5  2.0 

N  =  146  N  =  21 4  N  =  1 1 1 


rs10866528_G  :  IRX4 
-log10(P)  =  50.94 


0.0  0.5  1.0  1.5  2.0 

N  =  145  N  =  21 6  N  =  1 1 0 


Normalized  Expression  Normalized  Expression 


rs4975759_G  :  IRX4 
-loglO(P)  =  48.33 


0.0  0.5  1.0  1.5  2.0 

N  =  185  N  =  204  N  =  82 


Normalized  Expression  Normalized  Expression 


rs34695572_A  :  IRX4 
-log10(P)  =  47.95 


Normalized  Expression  Normalized  Expression 


rs13177232_G  :  IRX4 
-log10(P)  =  44.82 

6 
4 
2 
0 

0.0  0.5  1.0  1.5  2.0 

N  =  1 88  N  =  203  N  =  80 


rs35375589_A  :  IRX4 
-loglO(P)  =  44.13 


0.0  0.5  1.0  1.5  2.0 

N  =  184  N  =  206  N  =  81 


Normalized  Expression  Normal 


rs13177600_A  :  IRX4 
-log10(P)  =  43.71 


0.0  0.5  1.0  1.5  2.0 

N  =  185  N  =  206  N  =  80 


rs35010507_G  :  IRX4 
-log10(P)  =  43.48 


o.o 

N  =  185 


0.5 


1.0 

N  =  206 


1.5 


2.0 

N  =  80 


Observed  p-value 


Appendix  5:  eQTL  analysis  for  Chromosome  19  region  of  interest 

CIS:  chrl 9.3863561 3.3883561 3.1 12.CIS  TRANS:  chrl 9.3863561 3.3883561 3.1 12.TRAN 
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Table  1:  List  of  PC  risk-SNPs  used  for  the  study,  including  Chromosome  location 

SNP  ID 

Chr. 

GRCh37/hg19 

Position 

Alleles  (risk) 

Intra/intergenic 

Gene 

LD  region 

rs636291 

1p35 

10,556,097 

A/G 

chrl.1 0456097. 1 0656097 .112 

unpublished  meta  analysis 

rs  17599629 

1q21 

150,658,287 

G/A 

intronic 

GOLPH3L 

chrl.1 50558287. 1 50758287.49 

unpublished  meta  analysis 

rs1218582 

1  q21 .3 

154,834,183 

A/G 

Intragenic-intron 

KCNN3 

chrl.1 547341 83.1 54934183. 1 72 

Eeles  2013 

rs4245739 

1 q32. 1 

204,518,842 

A/C 

Intragenic-intron 

MDM4 

chrl  .20441 8842.20461 8842.68 

Eeles  2013 

rs1775148 

1q32 

205,757,824 

C/T 

chrl  .205657824.205857824. 1 06 

unpublished  meta  analysis 

rsl 1902236 

2p25.1 

10,117,868 

G/A 

Intragenic-intron 

GRHL1 

chr2.10017868.10217868.182 

Eeles  2013 

rs92877 1 9 

2p25 

10,710,730 

C/T 

161  bp  3’ 

NOL10 

chr2.10610730.10810730.147 

unpublished  meta  analysis 

rsl 33851 91 

2p24.1 

20,888,265 

A/G 

Intragenic-intron 

C2orf43 

chr2.20788265.20988265.160 

Takata  2010 

rsl  4656 18 

2p21 

43,553,949 

A/G 

Intragenic-intron 

THADA 

chr2.43453949.43653949.82 

Eeles  2009 

rs721048 

2p15 

63,131,731 

A/G 

Intragenic-intron 

EHBP1 

chr2. 63031 731 .63401 1 64.88 

Gudmundsson  2008 

rs6545977 

2p15 

63,301,164 

A/G 

Intergenic 

chr2. 63031 731 .63401 1 64.88 

Eeles  2009 

rs2028898 

2p1 1 .2 

85,777,270 

C/T 

Intragenic-intron 

GGCX 

chr2.85677270.85894297.100 

Akamatsu  2012 

rsl  01 87424 

2p11.2 

85,794,297 

A/C 

Intergenic 

chr2.85677270.85894297.100 

Kote-Jarai  201 1 

rsl  262 1278 

2q31.1 

173,311,553 

A/G 

Intragenic-intron 

ITGA6 

chr2. 1 7321 1 553. 1 7341 1 553. 1 75 

Eeles  2009 

rs7584330 

2q37.3 

238,387,228 

A/G/C 

Intergenic 

chr2.238287228.238543226.2 1 7 

Kote-Jarai  201 1 

rs2292884 

2q37.3 

238,443,226 

A/G 

intragenic-intron 

MLPH 

chr2.238287228.238543226.2 1 7 

Schumacher  2011 

rs3771570 

2q37.3 

242,382,864 

G/A 

Intragenic-intron 

FARP2 

chr2.242282864.242482864.97. 

Eeles  2013 

rs931 1171 

3p22.2 

37,996,477 

G/T 

Intragenic-intron 

CTDSPL 

chr3. 37896477. 38096477.85 

Murabito  2007 

rs2660753 

3p1 2.1 

87,110,674 

C/T 

Intergenic 

chr3.87010674.87341497.136 

Eeles  2008 

rs9284813 

3p1 2.1 

87,152,169 

A/G 

Intergenic 

chr3.87010674.87341497.136 

Takata  2010 

rs17181 170 

3p12.1 

87,173,324 

A/G 

Intergenic 

chr3.87010674.87341497.136 

Eeles  2009 

rs7629490 

3p11.2 

87,241,497 

C/T 

Intergenic 

chr3.87010674.87341497.136 

Schumacher  2011 

rs2055109 

3p1 1 .2 

87,467,332 

C/T 

Intergenic 

chr3.87367332.87567332.101. 

Akamatsu  2012 

rs76 11694 

3q13.2 

113,275,624 

A/C 

Intragenic-intron 

SIDT1 

chr3. 1 1 31 75624. 1 1 3375624. 1 05 

Eeles  2013 

rsl  0934853 

3q21.3 

128,038,373 

A/C 

Intragenic-intron 

EEFSEC 

chr3. 1 27938373. 1 281 38373.49 

Gudmundsson  2009 

rs6763931 

3q23 

141,102,833 

G/A/T 

intragenic-intron 

ZBTB38 

chr3. 1 41 002833. 1 41 202833. 1 03 

Kote-Jarai  201 1 

rs345013 

3q24 

145,173,788 

C/T 

Intergenic 

chr3. 145073788. 145273788. 65. 

Murabito  2007 

rsl 0936632 

3q26.2 

170,130,102 

A/C 

Intergenic 

chr3. 1 700301 02. 1 702301 02.97 

Kote-Jarai  201 1 

rsl  0009409 

4q13 

73,855,253 

T/C 

65  kb  3’ 

COX  18 

chr4. 73755253. 73955253. 78 

unpublished  meta  analysis 

rsl  894292 

4q13.3 

74,349,158 

G/A 

Intragenic-intron 

AFM 

chr4.742491 58.744491 58.75 

Eeles  2013 

rsl  2500426 

4q22.3 

95,514,609 

A/C 

Intragenic-intron 

PDLIM5 

chr4.95414609.95662877.135. 

Eeles  2009 

rsl  7021 91 8 

4q22.3 

95,562,877 

C/T 

Intragenic-intron 

PDLIM5 

chr4.9541 4609.95662877.1 35. 

Eeles  2009 

rs7679673 

4q24 

106,061,534 

A/C 

Intergenic 

chr4. 105961 534. 1061 61 534.83 

Eeles  2009 

rs2242652 

5p15.33 

1,280,028 

G/T 

intragenic-intron 

TERT 

chr5. 1180028.1398733.193 

Kote-Jarai  2011,  Nam  2012 

rs7725218 

5p15.33 

1,282,414 

A/G 

intragenic-intron 

TERT 

chr5.1 1 80028. 1 398733. 1 93 

Kote  Jarai  2013 

rs2853676 

5p15.33 

1,288,547 

A/G 

intragenic-intron 

TERT 

chr5. 1180028.1398733.193 

Kote  Jarai  2013 

rsl 31 90087 

5p15.33 

1,298,733 

A/C 

intergenic 

chr5. 1 1 80028. 1 398733. 1 93 

Kote  Jarai  2013 

rsl 2653946 

5p15.33 

1,895,829 

C/T 

Intergenic 

chr5.1 795829.1 995829.220 

Takata  2010,  Cheng  2012 

rs2121875 

5p12 

44,365,545 

G/T 

intragenic-intron 

FGF10 

chr5.44265545.44465545.72. 

Kote-Jarai  2011 

rs4466137 

5q14.3 

82,985,739 

G/T 

Intragenic-intron 

HAPLN1 

chr5.82885739. 83085739. 1 24 

Murabito  2007 

rs37181 

5q23.1 

115630004 

A/C 

intergenic 

chr5. 1 1 5530004. 1 1 5730004. 1 1 7 

Kote-Jarai  2011 

rs6869841 

5q35.2 

172,939,426 

G/A 

Intergenic 

chr5. 1 72839426. 1 73039426. 1 76 

Eeles  2013 

rs47 13266 

6p24 

11,219,030 

C/T 

intronic 

NEDD9 

chr6.1 1 1 19030.1 1319030.200 

unpublished  meta  analysis 

rsl  15457135  (rs7767188) 

6p22 

30,073,776 

A/G 

intronic 

TRIM31 

chr6.29973776.301 73776.600 

unpublished  meta  analysis 

rsl 30067 

6p21.33 

31,118,511 

A/G 

missense 

CCHCR1 

chr6. 3101851 1 .3121 851 1 .772 

Kote-Jarai  201 1 

rs3096702  (rsl  14376585) 

6p21 .32 

32,192,331 

G/A 

Intergenic 

chr6. 32092331 .32292331 .467 

Eeles  2013 

rsl  15306967 

6p21 

32,400,939 

G/C 

5  kb  5’ 

HLA-DRB6 

chr6. 32300939. 32500939. 7 1 5 

unpublished  meta  analysis 

rsl  983891 

6p21.1 

41,536,427 

C/T 

Intragenic-intron 

FOXP4 

chr6.41 436427.41 636427. 1 90 

Takata  2010 

rsl 0498792 

6p12.3 

51,666,631 

C/T 

Intragenic-intron 

PKHD1 

chr6.51 566631 .51766631 .91 

Murabito  2007 

rs9443189 

6q14 

76,495,882 

G/A 

chr6. 76395882. 76595882. 52. 

unpublished  meta  analysis 

rs2273669 

6q21 

109,285,189 

A/G 

Intragenic-intron 

ARMC2 

chr6. 1 091851 89.1 093851 89.70 

Eeles  2013 

rs339331 

6q22.1 

117,210,052 

C/T 

Intragenic-intron 

RFX6 

chr6. 1 1 71 1 0052. 1 1 731 0052.82 

Takata  2010 

rsl 933488 

6q25.2 

153,441,079 

A/G 

Intragenic-intron 

RGS17 

chr6. 1 53341 079. 1 53541079. 1 1 7 

Eeles  2013 

rs651164 

6q25.3 

160,581,374 

A/G 

Intergenic 

chr6. 1 60481 374. 1 60681374. 1 69 

Eeles  2009,  Schumacher  201 1 

rs9364554 

6q25.3 

160,833,664 

C/T 

Intragenic-intron 

SLC22A3 

chr6. 1 60,733,664. 1 60933664. 1 51 

Eeles  2008 

rs12155172 

7p15.3 

20,994,491 

A/G 

Intergenic 

chr7.20894491. 21094491. 113 

Eeles  2009,  2013 

rsl 0486567 

7p15.2 

27,976,563 

A/G 

Intragenic-intron 

JAZF1 

chr7.27876563.28076563.143 

Thomas  2008,  Zheng  2009 

rs56232506 

7p12 

47,437,244 

A/G 

intronic 

TNS3 

chr7.47337244.47537244.142 

unpublished  meta  analysis 

rs6465657 

7q21.3 

97,816,327 

C/T 

Intragenic-intron 

LMTK2 

chr7.9771 6327.9791 6327.72 

Eeles  2008,  2009 

rs2928679 

8p21.2 

23,438,975 

C/T 

Intergenic 

chr8.23338975.23626463.307 

none? 

rsl  51 2268 

8p21.2 

23,526,463 

A/G/T 

Intergenic 

chr8.23338975.23626463.307 

Eeles  2009,  Takata  2010,  Cheng  2012 

rsl 11 35910 

8p21 .2 

25,892,142 

G/A 

Intragenic-intron 

EBF2 

chr8.25792 1 42.259921 42. 1 87. 

Eeles  2013 

rs979200 

8q24.21 

127,923,720 

A/G 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Salinas  2008 

rsl 2543663 

8q24.21 

127,924,659 

A/C 

Intergenic 

chr8. 127823720. 128723639.822 

Al  Olama  2009 

rsl 0086908 

8q24.21 

128,011,937 

T/C 

Intergenic 

chr8. 127823720. 128723639.822 

Al  Olama  2009 

rsl 01 6343 

8q24.21 

128,093,297 

C/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Eeles  2008,  Schumacher  201 1 

rsl 3252298 

8q24.21 

128,095,156 

A/G 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Schumacher  2011 

rsl  4563 15 

8q24.21 

128,103,937 

A/G 

Intergenic 

chr8. 127823720. 128723639.822 

Takata  2010 

rsl 3254738 

8q24.21 

128,104,343 

A/C 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Salinas2008,  Haiman  2007,  Cheng  2012 

rs6983561 

8q24.21 

128,106,880 

A/C 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Salinas2008,  Haiman  2007,  Cheng  2012 

rsl  6901 979 

8q24.21 

128,124,916 

A/C 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Gudmundsson  2007A,  2009,  Zheng  2009 

rsl 0505483 

8q24.21 

128,125,195 

ATT 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Cheng  2012 

rsl  6902094 

8q24.21 

128,320,346 

A/G 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Gudmundsson  2009 

rs445114 

8q24.21 

128,323,181 

C/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Gudmundsson  2009,  Schumacher  201 1 

rs620861 

8q24.21 

128,335,673 

C/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Al  Olama  2009 

rs6983267 

8q24.21 

128,413,305 

G/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Eeles  2008,  Thomas  2008,  Yeager  2007, 
Zheng  2009,  Schumacher  2011 

rs7837328 

8q24.21 

128,423,127 

A/G 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Yeager  2007 

rs7000448 

8q24.21 

128,441,170 

C/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Salinas2008,  Haiman  2007 

rsl 447295 

8q24.21 

128,485,038 

A/C 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Gudmundsson  2007A,  2009,  Yeager  2007 

rs4242382 

8q24.21 

128,517,573 

A/G 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Thomas  2008 

rs4242384 

8q24.21 

128,518,554 

A/C 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Eeles  2008,  2009,  Schumacher  201 1 

rsl  0090 154 

8q24.21 

128,532,137 

C/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Salinas  2008,  Haiman  2007,  Cheng  2012 

rs7837688 

8q24.21 

128,539,360 

G/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

Takata  2010 

rs7005795 

8q24.21 

128,623,639 

G/T 

Intergenic 

chr8. 1 27823720. 1 28723639.822 

none? 

rsl 7694493 

9p21 

22,041,998 

G/C 

intronic 

CDKN2B-AS1 

chr9.21 941 998.221 41 998.99 

unpublished  meta  analysis 

rs817826 

9q31 .2 

110,156,300 

T/C 

Intergenic 

chr9. 1 1 0056300. 1 1 0256300. 1 86 

Xu  2012 

rsl 571 801 

9q33.2 

124,427,373 

A/C 

Intragenic-intron 

DAB2IP 

chr9. 1 24327373. 1 24527373. 1 1 3 

Duggan  2007 

rs76934034 

10q11 

46,082,985 

T/C 

intronic 

MARCH8 

chrl  0.45982985.461 82985.48 

unpublished  meta  analysis 

rs31 23078 

10q1 1.23 

51,524,971 

C/T 

Intergenic 

chrlO.51424971. 51649496.56. 

Eeles  2009 

rsl  0993994 

10q1 1.23 

51,549,496 

C/T 

Intergenic 

chrl  0.51 424971 .51 649496.56. 

Eeles  2008,  Thomas  2008,  Takata  2010, 
Schumacher  2011 

rs3850699 

10q24.32 

104,414,221 

A/G 

Intragenic-intron 

TRIM8 

chrl  0. 1 043 1 4221 . 1 04514221 .67 

Eeles  2013 

rs2252004 

10q26.12 

122,844,709 

G/T 

Intergenic 

chrl  0.1 22744709.1 231 3251 9.253. 

Akamatsu  2012 

rsl  11 99874 

10q26.12 

123,032,519 

A/G 

Intergenic 

chrl  0.1 22744709.1 231 3251 9.253. 

Nam  2012 

rs4962416 

10q26.13 

126,696,872 

C/T 

Intragenic-intron 

CTBP2 

chrl  0.1 26596872. 1 26796872.265 

Thomas  2008 

rs71 27900 

1 1  pi  5.5 

2,233,574 

A/G 

Intergenic 

chrl  1 .21 33574.2333574.1 60. 

Eeles  2009 

rsl 938781 

1 1  ql  2 

58,915,110 

T/C 

Intragenic-intron 

FAM111A 

chrl  1 .588151 1 0.5901 51 1 0.73 

Akamatsu  2012 

rsl 241 8451 

1 1  ql  3.2 

68935419 

A/G 

intergenic 

LNCRNA  RP1 1-554A1 1 .8 

chrl  1 .6883541 9.69095958.206 

Zheng  2009 

rsl  1228565 

1 1  ql  3.2 

68,978,580 

A/G 

Intergenic 

chrl  1.68835419.69095958.206 

Gudmundsson  2009 

rs7931342 

1 1  ql  3.2 

68,994,497 

G/T 

Intergenic 

chrl  1 .6883541 9.69095958.206 

Eeles  2008 

rsl  0896449 

1 1  ql  3.2 

68,994,667 

A/G 

Intergenic 

chrl  1 .6883541 9.69095958.206 

Thomas  2008,  Zheng  2009 

rs71 30881 

1 1  ql  3.3 

68,995,958 

A/G 

Intergenic 

chrl  1 .6883541 9.69095958.206 

Eeles  2009,  Schumacher  201 1 

rsl 156881 8 

1 1q22.2 

102,401,661 

A/G 

Intergenic 

chrl  1 . 1 02301 661 . 1 02501661 . 1 77 

Eeles  2013 

rsl 1214775 

1 1q23 

113,807,181 

G/A 

intronic 

HTR3B 

chrl  1.1 13707181.1 13907181.105 

unpublished  meta  analysis 

rs731236 

1 2q1 3. 1 2 

48238757 

C/T 

synonymous 

VDR 

chrl  2.481 38757.4851 961 8.295. 

Bonilla  2011 

rs80130819 

1 2q1 3 

48,419,618 

A/C 

17  kb  3’ 

SENP1 

chrl  2.481 38757.4851 961 8.295. 

unpublished  meta  analysis 

rsl 0875943 

1 2q1 3. 1 2 

49,676,010 

C/T 

Intergenic 

chrl  2.4957601 0.4977601 0.86. 

Kote-Jarai  201 1 

rs902774 

1 2q1 3. 1 3 

53,273,904 

A/G 

Intergenic 

chrl  2.531 73904.53373904. 1 44 

Schumacher  2011 

rs  12827748 

1 2q21 .31 

80088578 

C/T 

intergenic 

chrl  2.79988578.801 88578.44 

Bonilla  2011 

rs  1270884 

12q24.21 

114,685,571 

G/A 

Intergenic 

chr12.1 14585571 .1 14785571 .183. 

Eeles  2013 

rs9600079 

13q22.1 

73,728,139 

G/T 

Intergenic 

chr13.73628139.73828139.181 

Takata  2010 

rs  1529276 

13q33.1 

103,928,007 

ATT 

Intergenic 

chrl  3. 1 03828007. 1 04028007. 1 61 

Murabito  2007 

rs8008270 

14q22.1 

53,372,330 

G/A 

Intragenic-intron 

FERMT2 

chr14.53272330.53472330.72. 

Eeles  2013 

rs71 53648 

14q23 

61,122,526 

C/G 

chr14.61 022526.61 222526.42. 

unpublished  meta  analysis 

rs7141529 

14q24.1 

69,126,744 

A/G 

Intragenic-intron 

RAD51L1 

chrl  4.69026744.69226744.245 

Eeles  2013 

rs80 14671 

14q24 

71,092,256 

G/A 

16  kb  5’ 

TTC9 

chr14.70992256.71 192256.161 

unpublished  meta  analysis 

rs4775302 

15q21.1 

46,639,808 

A/G 

Intergenic 

chrl  5.46539808.46739808.1 16. 

Nam  2013 

rsl 2051 443 

16q22 

71,691,329 

A/G 

chrl  6.71 591 329.71 791 329.67. 

unpublished  meta  analysis 

rs684232 

17p13.3 

618,965 

A/G 

Intergenic 

chr17.518965.718965.117. 

Eeles  2013 

rsl  1649743 

1 7q1 2 

36,074,979 

A/G 

Intragenic-intron 

HNF1B 

chrl  7.35974979.36201 1 56.1 86 

Sun  2008,  Levin  2008 

rs4430796 

1 7q1 2 

36,098,040 

A/G 

Intragenic-intron 

HNF1B 

chrl  7.35974979.36201 1 56. 1 86 

Thomas  2008,  Gudmundsson  2007,  Levin 
2008,  Eeles  2009,  Gudmundsson  2009 

rs7501939 

1 7q1 2 

36,101,156 

C/T 

Intragenic-intron 

HNF1B 

chrl  7.35974979.36201 1 56.1 86 

Eeles  2008,  Sun  2008,  Levin  2008,  Takata 
2010,  Schumacher  2011 

rsl 1650494 

17q21.32 

47,345,186 

G/A 

Intergenic 

chrl  7.472451 86.47536749. 1 49. 

Eeles  2013 

rsl 859962 

17q24.3 

69,108,753 

G/T 

Intergenic 

chrl  7.69008753.69208753. 1 30 

Eeles  2008,  Gudmundsson  2007,  Levin 
2008,  Eeles  2009,  Schumacher  2011, 

rs7241993 

18q23 

76,773,973 

G/A 

Intergenic 

chr18. 76673973. 76873973.1 12 

Eeles  2013 

rs81 02476 

19q13.2 

38,735,613 

C/T 

Intergenic 

chr19.38635613.38835613.112 

Gudmundsson  2009 

rsl 1672691 

19q13.2 

41,985,587 

G/A 

Intragenic-intron 

LOCI  00505495 

chrl  9.41 885587.42085624.89. 

Al  Olama  2012,  2013 

rs887391 

19q13.2 

41,985,624 

C/T 

Intergenic 

chrl  9.41 885587.42085624.89. 

Hsu  2009 

rs2735839 

19q13.33 

51,364,623 

A/G 

Intergenic 

chr19.51264623.51464623.230 

Eeles  2008 

rsl 03294 

19q13.4 

54,797,848 

T/C 

Intragenic-intron 

LILRA3 

chrl  9.54697848.54897848. 1 46 

Xu  2012 

rsl  2480328 

20q13 

49,527,922 

T/C 

chr20.49427922.49627922. 1 24 

unpublished  meta  analysis 

rs2427345 

20q13.33 

61,015,611 

G/A 

Intergenic 

chr20.6091 561 1 .61 1 1561 1 .1 53 

Eeles  2013 

rs6062509 

20q13.33 

62,362,563 

A/C 

Intragenic-intron 

ZGPAT 

chr20.62262563.62462563.84. 

Eeles  2013 

rsl  04 1449 

21q22 

42,901,421 

G/A 

chr21 .42801421 .43001421 .189 

unpublished  meta  analysis 

rs2238776 

22q1 1 

19,757,892 

G/A 

chr22. 1 9657892. 1 9857892. 1 32 

unpublished  meta  analysis 

rsl 1704416 

22q13.1 

40,436,973 

G/C 

Intergenic 

chr22.40336973.405521 19.71 

Al  Olama  2012,2013 

rs96231 17 

22q13.1 

40,452,119 

C/T 

Intragenic-intron 

TNRC6B 

chr22.40336973.405521 19.71 

Sun  2009 

rs5759167 

22q13.2 

43,500,212 

G/T 

Intergenic 

chr22.43400212.43618275.191 

Eeles  2009 

rs742134 

22q13.2 

43,518,275 

A/G 

intragenic-intron 

BIK 

chr22.4340021 2.4361 8275.1 91 

Schumacher  2011 

rs2405942 

Xp22.2 

9,814,135 

A/G 

Intragenic-intron 

SHROOM2 

chr23.9714135.9914135.117 

Eeles  2013 

rsl 327301 

Xp11.22 

51,210,057 

C/T 

Intergenic 

chr23.51 1 1 0057.51 341 672.32 

Eeles  2009 

rs5945572 

Xp11.22 

51,229,683 

A/G 

Intergenic 

chr23.51 1 10057.51341672.32 

Gudmundsson  2008 

rs5945619 

Xp11.22 

51,241,672 

C/T 

Intergenic 

chr23.51 1 1 0057.51 341 672.32 

Eeles  2008 

rs2807031 

Xpll 

52,896,949 

C/T 

intronic 

XAGE3 

chr23. 52796949. 52996949. 1 7 

unpublished  meta  analysis 

rs59 19432 

Xq12 

67,021,550 

C/A 

Intergenic 

chr23.66921 550.671 21 550.26 

Kote-Jarai  201 1 

rs6625711 

Xq13 

70,139,850 

ATT 

36  kb  3' 

SLC7A 

chr23.70039850.70239850.62 

unpublished  meta  analysis 

rs4844289 

Xq13 

70,407,983 

G/A 

16kb  3’ 

NLGN3 

chr23. 70307983. 70507983. 74 

unpublished  meta  analysis 

Table  2:  Number  of  SNPs  and  number  of  genes  evaluated  for  each  of  the  risk  regions 

Risk  SNP  ID 

LD  region  for  SNP  evaluation 

2  Mb  ROI  for  gene  evaluation 

#  SNPs 

evaluated 

(nSNPs) 

#  genes  in  ROI 
(total) 

#  genes  evaluated 
(ngene) 

#  tests  (Nfreq) 

rs636291 

chrl.1 0456097. 1 0656097. 1 1 2 

chrl  .9456097. 1 1 656097. 1 1 2 

71 

27 

23 

1662 

rs 17599629 

chrl.1 50558287. 1 50758287.49 

chrl .  1 49558287. 1 51 758287.49 

11 

75 

61 

670 

rs1218582 

chrl.1 54734183. 1 549341 83. 1 72 

chrl  .1 537341 83. 1 55934183. 1 72 

87 

76 

63 

5411 

rs4245739 

chrl  .204418842.204618842.68 

chrl  .20341 8842.20561 8842.68 

131 

36 

28 

3683 

rsl  775148 

chrl  .205657824.205857824. 1 06 

chrl  .204657824.206857824. 106 

40 

31 

27 

1100 

rsl  1902236 

chr2.1 001 7868. 1 021 7868. 1 82 

chr2.901 7868. 11217868.182 

11 

21 

18 

198 

rs9287719 

chr2. 1 061 0730. 1 0810730. 1 47 

chr2.961 0730. 1 1810730. 1 47 

269 

26 

19 

4667 

rsl  33851 91 

chr2.20788265.20988265. 1 60 

chr2. 1 9788265.21 988265. 1 60 

11 

13 

12 

132 

rsl  46561 8 

chr2 .43453949 .43653949. 82 

chr2.42453949.44653949.82 

8 

19 

17 

136 

rs721048,  rs6545977 

chr2.63031 731 .63401 1 64.88 

chr2. 62031 731 .64401 1 64.88 

41 

14 

13 

460 

rs2028898,  rsl  01 87424 

chr2.85677270.85894297. 1 00 

chr2. 84677270. 86894297. 1 00 

111 

37 

34 

3727 

rsl  2621 278 

chr2.1 7321 1 553. 173411 553.1 75 

chr2.1 7221 1 553. 1 7441 1 553. 1 75 

106 

16 

13 

1354 

rs7584330,  rs2292884 

chr2.238287228.238543226.21 7 

chr2.237287228. 239543226. 21 7 

258 

23 

19 

4902 

rs3771570 

chr2.242282864. 242482864. 97. 

chr2.241 282864.243482864.97 

69 

35 

27 

1863 

rs931 1171 

chr3. 37896477. 38096477. 85 

chr3. 36896477. 39096477. 85 

27 

25 

21 

544 

rs2660753,  rs9284813,  rsl  71 81 170,  rs762949( 

chr3.8701 0674.87341 497. 136 

chr3.8601 0674.88341 497. 136 

165 

9 

7 

1126 

rs2055109 

chr3.87367332.87567332.101. 

chr3.86367332.88567332.101 

87 

8 

6 

522 

rs76 11694 

chr3.1 131 75624. 1 1 3375624. 1 05 

chr3. 1 1 21 75624. 1 1 4375624. 1 05 

24 

26 

23 

552 

rsl  0934853 

chr3. 1 27938373. 1 281 38373.49 

chr3. 1 26938373. 1 291 38373.49 

44 

33 

27 

1192 

rs6763931 

chr3.1 41 002833. 141202833. 1 03 

chr3. 1 40002833. 1 42202833. 1 03 

51 

14 

11 

559 

rs345013 

chr3. 1 45073788. 1 45273788.65. 

chr3. 1 44073788. 1 46273788.65 

89 

4 

4 

324 

rsl 0936632 

chr3.1 700301 02. 1 702301 02.97 

chr3. 1690301 02. 171 2301 02.97 

30 

21 

16 

480 

rsl  0009409 

chr4.73755253.73955253.78 

chr4. 72755253. 74955253. 78 
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